Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
501 stars 86 forks source link

Pattern::predict_match_dfa() causes assertion errors on MVSC++ #65

Closed ghost closed 4 years ago

ghost commented 4 years ago

When using the current release version 1.5.5, my program regenerates debug assertion errors when building under Visual C++ (have not tested other compilers/platforms).

The issue is coming from isprint() which is used inside of this block of code below.

    for (Index i = 0; i < Const::HASH; ++i)
    {
      if (pmh_[i] != 0xFF)
      {
        if (isprint(i))
          DBGLOGN("pmh['%c'] = %2x\n", i, pmh_[i]);
        else
          DBGLOGN("pmh[%3d] = %2x\n", i, pmh_[i]);
      }
    }
    for (Index i = 0; i < Const::HASH; ++i)
    {
      if (pma_[i] != 0xFF)
      {
        if (isprint(i))
          DBGLOGN("pma['%c'] = %2x\n", i, pma_[i]);
        else
          DBGLOGN("pma[%3d] = %2x\n", i, pma_[i]);
      }
    }

The _chvalidator() function which I assume is called at some point inside of isprint() includes the following line: _ASSERTE(c >= -1 && c <= 255);.

I don't think isprint(i) was intended, since Const::HASH is 4096 and the loops would go over 255 regardless. I tried isprint(pmh_[i]) and isprint(pma_[i]) which does get rid of errors.

genivia-inc commented 4 years ago

Thanks for the feedback. Those calls should be isprint(pmh_[i]) and isprint(pma_[i]) as you point out. The problem happens when you're debugging the DFA construction, but not otherwise fortunately.

To be more precise, this part of the code constructs hash-based match predictors to speed up find by searching the input before a DFA match is attempted. The scan operation used in scanners doesn't use or need this feature.

I have experimented with several ways to speed up searching, by creating a set of tests to evaluate Boyer-Moore, Aho-Corasick, Bitap, and hashing-based algorithms. BM and AH are slow, but BM does beat memchr in some cases which is evaluated with a threshold function that picks memchr or BM. If memchr is picked, it uses the least frequently used character to search to speed this up further.

ghost commented 4 years ago

Ah I see, well thankfully it isn't a major issue and the fix is simple so I'll leave it at that.

Thank you.

genivia-inc commented 4 years ago

This problem will be fixed in the upcoming release as this issue shouldn't happen of course.

genivia-inc commented 4 years ago

Version 1.5.6 is available with the fix.