VectorCamp / vectorscan

A portable fork of the high-performance regular expression matching library
https://www.vectorcamp.gr/project/vectorscan/
Other
511 stars 55 forks source link

Vectorscan does not support backreferences with indices >= 8 (HS_FLAG_PREFILTER is on) #209

Open apismensky opened 11 months ago

apismensky commented 11 months ago

Regex a([ -]?)a\\1a|b([ .-]?)b\\2b|c([ -]?)c\\3c|d([ -]?)d\\4d|e([ -]?)e\\5e|f([ -]?)f\\6f|g([ -]?)g\\7g|h([ -]?)h\\8h should match all following strings: "a a a", "b b b", "c c c", "d d d", "e e e", "f f f", "g g g" and "h h h" but it matches everything except "h h h"

test to reproduce:

TEST(order, alexey1) {
    vector<pattern> patterns;
    patterns.push_back(pattern("a([ -]?)a\\1a|b([ .-]?)b\\2b|c([ -]?)c\\3c|d([ -]?)d\\4d|e([ -]?)e\\5e|f([ -]?)f\\6f|g([ -]?)g\\7g|h([ -]?)h\\8h", HS_FLAG_DOTALL | HS_FLAG_PREFILTER | HS_FLAG_MULTILINE | HS_FLAG_CASELESS | HS_FLAG_UCP | HS_FLAG_UTF8, 1));
    const char *data = "h h h";

    hs_database_t *db = buildDB(patterns, HS_MODE_NOSTREAM);
    ASSERT_NE(nullptr, db);

    hs_scratch_t *scratch = nullptr;
    hs_error_t err = hs_alloc_scratch(db, &scratch);
    ASSERT_EQ(HS_SUCCESS, err);

    CallBackContext c;
    err = hs_scan(db, data, strlen(data), 0, scratch, record_cb,
                  (void *)&c);
    ASSERT_EQ(HS_SUCCESS, err);

    EXPECT_EQ(1, countMatchesById(c.matches, 1));
    err = hs_free_scratch(scratch);
    ASSERT_EQ(HS_SUCCESS, err);
    hs_free_database(db);
}

There is some comment for 8 and 9 in: https://github.com/VectorCamp/vectorscan/blob/master/src/parser/Parser.rl#L1503 . But not sure why 8 and 9 are special cases? Are we supposed to pass them as octal numbers?

markos commented 11 months ago

It might be that it expects octal, I will do some local tests in this, but it could just as well be a bug. I admit I'm not very familiar with this part of the code.

seanrohead commented 11 months ago

@markos it looks like perl supports octal escapes, but it is supposed to interpret it as a backreference if there have been enough capture groups to make it a valid backreference.

See https://perldoc.perl.org/perlrebackslash#Disambiguation-rules-between-old-style-octal-escapes-and-backreferences

markos commented 10 months ago

@seanrohead @apismensky This is most likely related to pcre and one of the limitations/differences between pcre and pcre2:

https://stackoverflow.com/questions/70273084/regex-differences-between-pcre-and-pcre2/73767663#73767663

It will probably be fixed when #83 is fixed and we migrate to pcre2.