RE2 compatibility for 920120

allanrbo commented 4 years ago

Problem

Negative lookbehinds are not supported by many regex engines (RE2, Go's regexp, Hyperscan).

My motivation for avoiding this PCRE-specific construct is that I am working on an experimental WAF that uses CRS, but uses Hyperscan and Go's regexp package rather than PCRE. I'm hoping to be able to keep using vanilla CRS, so this is why I'm hoping to get this accepted in upstream. Others who are also doing similar experiments with non-PCRE regex engines will benefit too.

Suggested rewrite

The goal of the original regex was to disallow semicolon, except if it was preceeded by some strings that make it a well known HTML entity, such as ä, â, etc.

The approach of my solution is to write alternatives for each possible prefix that is not &[aAoOuUyY]uml;, &[aAeEiIoOuU]circ;, etc.

I have described the approach in detail here, and posted a Python script I used to generate the final regex: http://allanrbo.blogspot.com/2020/01/alternative-to-negative-lookbehinds-in.html

This is a tough tradeoff in human readability of the rule. If anyone has any ideas for a more readable approach, I'd be very eager to hear.

One alternative approach I've explored but failed at, is to split the rule into multiple chained rules. The first rule in the chain could catch something like (&[^;]+)?;. The second rule in the chain could then be a negative check, something like SecRule TX:1 "!@rx (&[aAoOuUyY]uml;|&[aAeEiIoOuU]circ|...)" .... Unfortunately this fails when the input has multiple semicolons. The first semicolon may be preceeded by &auml, thereby only evaluating the second chained rule for this TX:1, ignoring subsequent semicolons which may contain the attack.

Testing

Ran all the FTW regression tests.

Manually tested via regex101.com:

file=.txt               shouldMatchTrue
fi=le                   shouldMatchTrue
fi'le                   shouldMatchTrue
file                    shouldMatchFalse
fi;le                   shouldMatchTrue
;zzzz                   shouldMatchTrue
zzz&amp;zzz             shouldMatchFalse
zzz&Auml;zzz            shouldMatchFalse
zzz&auml;zzz            shouldMatchFalse
&amp;zzz                shouldMatchFalse
amp;zzz                 shouldMatchTrue
mp;zzz                  shouldMatchTrue
p;zzz                   shouldMatchTrue
Zamp;zzz                shouldMatchTrue
Zmp;zzz                 shouldMatchTrue
Zp;zzz                  shouldMatchTrue
Z;zzz                   shouldMatchTrue
ZZZamp;zzz              shouldMatchTrue
ZZZmp;zzz               shouldMatchTrue
ZZZp;zzz                shouldMatchTrue
ZZZ;zzz                 shouldMatchTrue
mZ;zzz                  shouldMatchTrue

Additionally wrote a little Go program to test against a few different regex engines: https://gist.github.com/allanrbo/93fb64549151169ff3a8a107cce8ce1b

fgsch commented 4 years ago

I haven't looked at this in detail yet but it occurs to me we could move it to util/regexp-assemble/regexp-920120.data to help readability, at least for the non-assembled version (assuming the generated version works the same way that the one in this PR).

fgsch commented 4 years ago

Having had another look, I realise the pattern presented is generated by a script. My bad, sorry.

Perhaps we should include your script in a similar way we include regexp-assemble.pl and generate the final pattern so the original pattern is readable.

As for the pattern, I believe you could extract the common suffix as well (^|). I do wonder how this compares, performance wise, with the original pattern when using pcre. Have you done any testing?

Finally, as with previous cases replacing lookarounds, the resulting match will be different. I don't think this is an issue but it's worth having in mind.

dune73 commented 4 years ago

Awesome PR and very interesting blogpost. Thank you very much @allanrbo.

I second @fgsch's remark about inclusion of the generator into our repository.

I have also looked up 920120 in my logs to see if we can come up with additional FTW tests - the 3 we have are a bit thin - but no luck. However, I suggest to add tests to the PR based on Allan's work and documentation. This will be useful as we move forward.

allanrbo commented 4 years ago

Thanks for reading. Good idea with including the generator script in this repo here, so it doesn't get lost. Will update the PR. Will also see if I can make some more FTW tests from those manual tests I mentioned above actually.

fgsch commented 4 years ago

Thanks for updating the PR.

Have you seen this part of my comment?

As for the pattern, I believe you could extract the common suffix as well (^|). I do wonder how this compares, performance wise, with the original pattern when using pcre. Have you done any testing?

allanrbo commented 4 years ago

Sorry, overlooked those questions. Good questions.

Regarding extracting ^

I don't think it's possible to extract out the ^. Consider this example:

(
    (^|[^c])def
    |
    (^|[^b])cdef
    |
    (^|[^a])bcdef
)

And further consider just this part:

(^|[^c])def

The ^ meaning "beginning of string" is only in one of the alternatives. The ^ in [^c] instead means "a character that is not c".

Regarding perf

With regards to performance, yes, it's worse. So it's a tradeoff. Especially for WAFs running ModSecurity, which is currently probably close to 100% of users. The reason I still propose it is that it paves the way to make CRS useful on non-ModSecurity WAFs too.

Here are some numbers I've generated from this test program, which evaluates the old and new regexes 1 million times in the input helloworld;helloworld and then divides the time by 1 million to get the time per evaluation:

PCRE with old regex: 1.632744578 microseconds
PCRE with new regex: 14.663659765 microseconds
Hyperscan with new regex: 0.414589807 microseconds

Regarding whether match is different

I've taken great care to ensure it's not different. That's what I've written about here: http://allanrbo.blogspot.com/2020/01/alternative-to-negative-lookbehinds-in.html . But sure, there could be a flaw in my reasoning that I haven't realized yet :-)

fgsch commented 4 years ago

Regarding extracting ^

Right. I understand the nuances of ^. I was a bit hasty here and didn't analyse the pattern in detail before I suggested the move though. You are entirely correct.

Regarding perf

Based on your tests we are talking about a ~9x performance penalty for 99.99% of users. I understand where you are coming from but this makes me a bit hesitant. The number of branches with backtracking makes me a bit uneasy as well. Perhaps both of these concerns are unwarranted.

FWIW, for Hyperscan, this is somewhat less of a problem since you can use Chimera. For RE2 unfortunately there is no alternative.

Regarding whether match is different

What I meant here is that while for the same input both patterns will match, what the patterns will match on will differ. Taking the input above and comparing the results before and after this change shows, for example:

 amp;zzz
- 0: ;
+ 0: amp;
 mp;zzz
- 0: ;
+ 0: mp;

fgsch commented 4 years ago

For a simpler test limited to pcre, you can run pcretest -TM <file> using the different patterns and inputs.

allanrbo commented 4 years ago

Hi Federico, thanks for the input. Yea I totally get if CRS doesn't want to merge this PR. It's fine if we just close it. I know it doesn't have any short term benefit for anyone besides me. The reason I thought I'd share it anyway, was because it may have a long term benefit, as it paves the way for CRS on future WAF engines that strictly avoid PCRE (like the experiment I'm currently prototyping).

I wish there was a way to achieve this without such a long, complex, and slow regex. I feel like there may be a way with chained rules, but just haven't been able to find one...

Perf discussion continued

Thinking of it as a 900% perf penalty I feel is kind of blunt. It's probably closer to 0.13%, and only on very specific requests. Here's why I believe so.

No requests will be 9x slower, but on those specific form-data/multipart POST requests that also contain a file name field, that one operation to scan the name and file name field that used to take ModSecurity around 0.001633 ms now is now 9x slower at around 0.01467 ms. On my computer with Nginx+ModSecurity I ran a couple of thousand runs of a simple 520 byte form-data/multipart POST non-keepalive request with a file field. It averaged at around 5.189 ms. So the theoretical overhead of this should be (0.01467-0.001633)/5.189=0.01303=0.13% on these requests. I reran with the new regex, and it avereged 5.171 ms - suprisingly lower but close enough to be a measurement error.

Regarding backtracking

RE2 will never backtrack. They achieve |-alternatives without backtracking, purely via state machines I think: https://github.com/google/re2/wiki/WhyRE2 .

pcretest

I couldn't find pcretest, but found pcre2test in Ubuntu. Assuming it's equivalent. Ran pcre2test -TM with the old and new regexes, with the above tests I wrote, and got the old to be Total match time 0.0028 milliseconds and the new to be Total match time 0.0924 milliseconds. So yea, the new regex is definitely slower on PCRE. No question about that.

Whether match is different

I understand what you mean now. Yes, TX:0 will be slightly different with the new regex, but since TX:0 is not used I don't think this is a problem?

fgsch commented 4 years ago

It's too early to tell either way (whether this will be merged) but the more information we have, the more informed the decision will be.

Perf discussion continued

This will be the worse case scenario based on your tests, but still something we should consider. My concerns are not limited to this, however. As I mentioned, the number of branches and backtracking makes me uneasy as well.

Regarding backtracking

I was referring to PCRE here, the only supported engine in ModSecurity at this time.

Whether match is different

TX.0 and MATCHED_VAR are basically the same, so it will affect logdata but as I mentioned previously I don't think it's a problem, just something to have in mind.

allanrbo commented 4 years ago

Whether match is different

I believe MATCHED_VAR will be the same before and after, because MATCHED_VAR contains the entire field that was matched, and not just the substring that the regex matched.

For example I just tested this rule

SecRule ARGS helloworld123 "id:100,deny,capture,logdata:'MATCHED_VAR: %{MATCHED_VAR}, TX.0: %{TX.0}'"

when given the URL

/?a=abchelloworld123abc

produces this logdata

[data "MATCHED_VAR: abchelloworld123abc, TX.0: helloworld123"]

dune73 commented 4 years ago

That's the correct and expected behavior for MATCHED_VAR, I think.

fgsch commented 4 years ago

I stand corrected. As you said, there will be no difference in logging with this change.

dune73 commented 4 years ago

Meeting decision: @dune73 will look into this, namely the viability of the performance impact and we can then see if we accept the PR (or drop it).

https://github.com/SpiderLabs/owasp-modsecurity-crs/issues/1671#issuecomment-584320407

dune73 commented 4 years ago

Adding the needs-action label to express the fact that we need hard performance data from production.

SpiderLabs / owasp-modsecurity-crs