Perf issue with regexes that start with repeating digits

allanrbo commented 4 years ago

On the Slack channel 2020-03-02, Airween pointed out some PCRE performance issues with requests containing long consecutive strings of digits. He suggested a fix using the (*SKIP) flag. Since this flag is PCRE specific, I'd like to avoid it, in order to pave the way for modsec-alternatives that use regex libraries with guarenteed linear run time (RE2, Hyperscan, and Go's Regexp, for example).

This PR relates is a suggestion to modify 941120, 942210, and 942260.

Here's my understanding of the issue: The regexes in these rules all have the option to begin matching with a repeated number of digits, such as [0-9]+. For example the regex [0-9]+a would match 11111a, 111111111a, 11111111111111a, etc. Now the problem arises with for example a string like 11111 (without an a). In this example, PCRE has consumed 11111 and then looks for a, doesn't find it, and therefore backtracks to try instead with just 1111, then 111, then 11, and then finally try with 1. ~To avoid this backtracking, we can instead anchor the regex before the [0-9]+, to either the beginning of the string or some character that is not a digit. The fixed example then becomes (?:^|[^0-9])[0-9]+a.~

Mirko suggested below a much cleaner approach than my first approach: simply don't bother with the quantifier. Updated the PR with this cleaner approach.

Tested manually with a large request as suggested by Airween, and verified that now perform many orders of magnitude faster.

There is also a perf issue with 942130, but I will send a PR for a suggestion to fix this separately, as I have (by chance) been working on some changes to this rule for some weeks.

mirkodziadzka-avi commented 4 years ago

If we are looking at [0-9]+a to detect an attack - we are not extracting part of data, but only want to know if this regex would match the input.

In this case, this regex is equivalent to [0-9]a

The former will match if and only if the later will match. So we can simplify this without anchoring the request. And we would avoid the backtracking here.

So instead of

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|REQUEST_HEADERS:User-Agent|REQUEST_HEADERS:Referer|ARGS_NAMES|ARGS|XML:/* "@rx (?i)[\s\"'`;\/0-9=\x0B\x09\x0C\x3B\x2C\x28\x3B]+on[a-zA-Z]+[\s\x0B\x09\x0C\x3B\x2C\x28\x3B]*?=" \

we could write

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|REQUEST_HEADERS:User-Agent|REQUEST_HEADERS:Referer|ARGS_NAMES|ARGS|XML:/* "@rx (?i)[\s\"'`;\/0-9=\x0B\x09\x0C\x3B\x2C\x28\x3B]on[a-zA-Z]+[\s\x0B\x09\x0C\x3B\x2C\x28\x3B]*?=" \

dune73 commented 4 years ago

That makes sense to me.

allanrbo commented 4 years ago

I think that's a great suggestion, Mirko. Much simpler. I'll either update this PR or if Airween is sending one I'll close this one. Thanks!

airween commented 4 years ago

Let me make some performance check, then we can decide.

mirkodziadzka-avi commented 4 years ago

I do not claim to understand the patterns, but after starring for some minutes on the 3 changes, they looks fine to me. In all 3 cases we are changing a repeated element in the beginning of the regex (start of a top-level alternative) to a single occurrence. Which is semantically equivalent for our use case.

A side remark:

If you are looking into more detail here, there may be room for making these regexes smaller.

For example, there is a repeating pattern

|--\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|#\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|;\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|

or in more readable

|
--\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|
#\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|
;\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|

which seems to be the same as

|(?:--|#|;)\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|

Another side remark:

Seeing this, I think that \s* is better than \s*? here. We do not need the non-greedy qualified, because we do not want to extract data. And for the question match-or-no-match, the difference between * and *? is irrelevant. As far as I understand the implementation, even from a performance point of view. Our default case is "no-match" and in this case, the backtracking engine has to check all alternatives anyway

Another side remark:

In a former company, we used macro-expansion to generate regexes from smaller snippets which made the whole stuff more readable. I wonder if this would be useful here. Do we have repeating patterns between different rules?

allanrbo commented 4 years ago

Good suggestions for simplifications. Maybe let's keep them in a separate PR, so it's easier to understand the reason for the changes?

We do not have a macro expansion step for the regexes in this project, but we do as a matter of fact have a regex compilation step that combines smaller regexes into bigger ones. Had forgotten about it here until I saw your comment. Pushed a new iteration with the regexp data files updated. Somehow the regexp-assemble.pl script decided to move things around a bit though. Don't know why. It's fairly black box to me, that regexp-assemble.pl script.

allanrbo commented 4 years ago

Oddly, regexp-assemble.pl added an extra \ in front of one of the + in 942260. This seems significant, and doesn't seem intentional (it's not there in the regexp-942260.data file). @franbuehler do you happen to know what's going on? Should I just hand-edit the output of regexp-assemble.pl a little in this strange case?

dune73 commented 4 years ago

Please don't hand-edit.

@fgsch: Do you have an idea what's the matter here?

Also: There might be regex-snippet re-use across multiple rules. But even if it would be a simplification in some regards, I'm reluctant to welcome this as it grows the complexity of the regex generation even more.

mirkodziadzka-avi commented 4 years ago

Also: There might be regex-snippet re-use across multiple rules. But even if it would be a simplification in some regards, I'm reluctant to welcome this as it grows the complexity of the regex generation even more.

We did use is for stuff which should be always the same: for example:

SQLI_START = ("|'|;|`|--)\s*
SQL_COMMENT = ...

But you are right. It is a question of complexity. In our case back then, it reduced complexity

airween commented 4 years ago

A quick performance test result:

a small C code
using libpcre
first argument is the regex pattern in file
second is the input
iterating 10 times

./demo mirko.txt 130knomatch.txt 
Time elapsed: 0.010099
Time elapsed: 0.004793
Time elapsed: 0.004677
Time elapsed: 0.005202
Time elapsed: 0.004632
Time elapsed: 0.005121
Time elapsed: 0.004704
Time elapsed: 0.004770
Time elapsed: 0.005645
Time elapsed: 0.004700

./demo allan.txt 130knomatch.txt 
Time elapsed: 0.013856
Time elapsed: 0.008231
Time elapsed: 0.008282
Time elapsed: 0.008459
Time elapsed: 0.008216
Time elapsed: 0.008523
Time elapsed: 0.008440
Time elapsed: 0.008185
Time elapsed: 0.008262
Time elapsed: 0.008202

very similar values in case of a long matches pattern.

I think Mirko's suggestion is better.

Any other ideas (related to test, fix, ...) are welcome.

dune73 commented 4 years ago

Cool. Could you please share the code? Could we integrate this into utils?

allanrbo commented 4 years ago

Yea I totally agree, don't use my initial suggestion. Removing code is much better than adding code to achieve the same. This PR is already updated to use Mirko's approach.

dune73 commented 4 years ago

But the problem with the assembly script persists, does not it? We should not merge as long as we have not sorted that one out.

mirkodziadzka-avi commented 4 years ago

@airween do you have the timings for the original regex and the same input as comparison?

airween commented 4 years ago

Yes, I checked the original regex, but didn't listed here because it was the expected value - they have very loooong time... above 13 secs all.

But you're right, here are the full and fresh list:

./demo original.txt 130knomatch.txt 
Time elapsed: 13.471386
Time elapsed: 13.823770
Time elapsed: 13.337333
Time elapsed: 13.382550
Time elapsed: 13.457501
Time elapsed: 13.569796
Time elapsed: 13.536542
Time elapsed: 13.344020
Time elapsed: 13.200333
Time elapsed: 13.445126

./demo mirko.txt 130knomatch.txt 
Time elapsed: 0.009132
Time elapsed: 0.006383
Time elapsed: 0.005335
Time elapsed: 0.004669
Time elapsed: 0.004787
Time elapsed: 0.004659
Time elapsed: 0.004647
Time elapsed: 0.004612
Time elapsed: 0.004631
Time elapsed: 0.004917

./demo allan.txt 130knomatch.txt 
Time elapsed: 0.014432
Time elapsed: 0.008248
Time elapsed: 0.008228
Time elapsed: 0.008176
Time elapsed: 0.008149
Time elapsed: 0.008233
Time elapsed: 0.008141
Time elapsed: 0.008217
Time elapsed: 0.008158
Time elapsed: 0.008269

airween commented 4 years ago

Cool. Could you please share the code?

sure, have to review and clean the code - now it's a piece of junk :).

Could we integrate this into utils?

I think we can, but need to discuss about how (I think there isn't any C source under utils/).

dune73 commented 4 years ago

I do not mind introducing c-source there. And if we do not like it long-term, we can always replace with python code. Please start the cleanup process. :)

mirkodziadzka-avi commented 4 years ago

@dune73 @airween As far as I know, there are python bindings for PCRE, so we may be able to write this in python. This would mean that we can use the python timeit module which gives stable results for such kind of tests.

Maybe we can talk about this later. I have some ideas how I can contribute here and would like to discuss them.

dune73 commented 4 years ago

That sounds great. Want to contact me on Slack?

airween commented 4 years ago

Python's PCRE binding is a bit old (I just found this one), but PCRE is still old :), so it's not necessary to adds more update.

Or we can use Python's FFI module. I've never used it before, I'm not clear about it's performance.

The advantage of Python is that we can parse the whole rule set (with secrules_parsing or msc_pyparser) and collect the @rx arguments - those are the regex's what we want to check, and can run as test. I also made a regex checker (it builds an AST from the regex) which tries to detect the infinity sequences, I introduced it here. We can continue that work too.

airween commented 4 years ago

Just a quick look to the python-pcre, and the first results (old, Mirko's, Allan's patterns):

Time elapsed: 17.679897
Time elapsed: 17.645132
Time elapsed: 17.548189
Time elapsed: 17.438846
Time elapsed: 17.513020
Time elapsed: 17.582721
Time elapsed: 17.630454
Time elapsed: 17.457768
Time elapsed: 17.510023
Time elapsed: 17.701713
=====
Time elapsed: 0.005044
Time elapsed: 0.005106
Time elapsed: 0.005306
Time elapsed: 0.005131
Time elapsed: 0.005197
Time elapsed: 0.005032
Time elapsed: 0.005181
Time elapsed: 0.005029
Time elapsed: 0.005245
Time elapsed: 0.005155
=====
Time elapsed: 0.009067
Time elapsed: 0.009174
Time elapsed: 0.009096
Time elapsed: 0.008877
Time elapsed: 0.009030
Time elapsed: 0.008947
Time elapsed: 0.009120
Time elapsed: 0.008731
Time elapsed: 0.008714
Time elapsed: 0.008739

As can be seen it has a much worse performance, but it seems useful anyway.

Any ideas?

dune73 commented 4 years ago

I do not think the performance difference is that big between C and Python bindings. The original has a 30-40% difference, but with the faster regexes, it's really quite low. I think that's perfectly acceptable, especially if we can avoid C that way. :)

airween commented 4 years ago

Well, my curiosity won, I tried the CFFI module - this is something catastrophic... Here you can find the code.

The results (Mirko's, Allan's):

Time elapsed: 0.006856
Time elapsed: 0.006678
Time elapsed: 0.006633
Time elapsed: 0.006714
Time elapsed: 0.006630
Time elapsed: 0.006581
Time elapsed: 0.006755
Time elapsed: 0.006698
Time elapsed: 0.006680
Time elapsed: 0.006658
=======
Time elapsed: 0.012194
Time elapsed: 0.012084
Time elapsed: 0.012091
Time elapsed: 0.012331
Time elapsed: 0.012059
Time elapsed: 0.012057
Time elapsed: 0.012047
Time elapsed: 0.012062
Time elapsed: 0.012266
Time elapsed: 0.012164

Okay, they are just a bit more worse than python-pcre - but see the original pattern:

Time elapsed: 273.886888
Time elapsed: 270.109169
Time elapsed: 271.969392
^C

(had to sent a KILL signal to process...)

I think we should stay at python-pcre - if we choose the "native" Python check :).

airween commented 4 years ago

Let me summarize the experiences for this issue.

For the better analyze, I've made a small tool which helps to check arguments of @rx operator for any rule. The tool tries to emulate the real behavior of both modsec engine (v2 and v3).

The current used rules takes long times to evaulate the regular expressions in case of v2 - but only if the engine doesn't use PCRE JIT feature:

1708/941120_0.txt - time elapsed: 13.211421
1708/942210_0.txt - time elapsed: 8.617792
1708/942260_0.txt - time elapsed: 8.639488

With JIT (or with v3), the problem doesn't exists:

1708/941120_0.txt - time elapsed: 0.000532
1708/942210_0.txt - time elapsed: 0.009518
1708/942260_0.txt - time elapsed: 0.003248

v3:

1708/941120_0.txt - time elapsed: 0.000394
1708/942210_0.txt - time elapsed: 0.005053
1708/942260_0.txt - time elapsed: 0.001524

First @allanrbo modified these regexes, then @mirkodziadzka-avi helped him to clean up. Now I reviewed the commits, and measured all scenarios.

Here you can find the regex's.

There are 9 files in 3 groups - the 3 affected rule, and 3 regex for every rule with a suffix. The 0 means the original (currently used), 1 is the first commit from Allan, 2 is Mirko's proposal.

Methodology I measured 3 times all regexes for all rules. Checked 10 times (with -n switch) the regex with

modsec2 with JIT
modsec3 (JIT enabled default)
modsec2 without JIT.

Then I removed the MIN and MAX values from the set, summaryze the remained 8 values, and divided by 8. In the next lines, you can see these values. The bold showed the better result for the compared regexes (with suffix 1 and 2)

regex name	engine	min value	max value	sum	avg value
941120_1.txt	v2 JIT	0.008214 s	0.009779 s	0.067451 s	0.008431375 s
941120_2.txt	v2 JIT	0.004878 s	0.007466 s	0.039632 s	0.004954000 s
941120_1.txt	v3	0.001065 s	0.002500 s	0.011098 s	0.00138725 s
941120_2.txt	v3	0.000052 s	0.000065 s	0.000418 s	0.00005225 s
941120_1.txt	v2 no JIT	0.000866 s	0.002111 s	0.010093 s	0.001261625 s
941120_2.txt	v2 no JIT	0.000032 s	0.000040 s	0.000293 s	0.000036625 s

942210_1.txt	v2 JIT	0.003827 s	0.008222 s	0.031062 s	0.003882750 s
942210_2.txt	v2 JIT	0.004782 s	0.009547 s	0.040991 s	0.005123875 s
942210_1.txt	v3	0.004066 s	0.007368 s	0.034355 s	0.004294375 s
942210_2.txt	v3	0.004728 s	0.005796 s	0.038315 s	0.004789375 s
942210_1.txt	v2 no JIT	0.073550 s	0.080512 s	0.592089 s	0.074011125 s
942210_2.txt	v2 no JIT	0.097047 s	0.104927 s	0.789010 s	0.098626250 s

942260_1.txt	v2 JIT	0.001630 s	0.003943 s	0.014886 s	0.001860750 s
942260_2.txt	v2 JIT	0.001633 s	0.003868 s	0.014176 s	0.001772000 s
942260_1.txt	v3	0.001812 s	0.004339 s	0.016818 s	0.002102250 s
942260_2.txt	v3	0.001638 s	0.001722 s	0.013368 s	0.001671000 s
942260_1.txt	v2 no JIT	0.030980 s	0.036878 s	0.249281 s	0.031160125 s
942260_2.txt	v2 no JIT	0.022523 s	0.025677 s	0.182497 s	0.022812125 s

How can you check Download the tool above, compile it. Download the patterns above into a subfolder, eg. 1708. Here are the commands:

src/pcre4msc2 -j -n 10 1708/941120_1.txt somuch1.txt
src/pcre4msc2 -j -n 10 1708/941120_2.txt somuch1.txt
src/pcre4msc3 -n 10 1708/941120_1.txt somuch1.txt
src/pcre4msc3 -n 10 1708/941120_2.txt somuch1.txt
src/pcre4msc2 -n 10 1708/941120_1.txt somuch1.txt
src/pcre4msc2 -n 10 1708/941120_2.txt somuch1.txt

Summary As you can see, Mirko's proposals has better performance than Allan's, except in case of 942210 - in case of that rule, Allan's commit is better nearly 25% which is significantly!

I think, we should close this PR, and open a new one, which contains Mirko's fix for 941120 and 942260, and Allan's for 942210. And that PR should be more cleaner (with less commit :)).

@allanrbo, @mirkodziadzka-avi - and others: any opinion?

mirkodziadzka-avi commented 4 years ago

@airween Thanks for the work!

I would like to understand the difference in the performance of 942210.

I noticed that the order of the 2nd and 3rd group between 942210_1.txt and 942210_2.txt is switched. I assume, regexp-assemble.pl did this? As a wild guess (no measurement involved), I would assume that this reorder is responsible for the performance difference here.

Will check this if I find time.

airween commented 4 years ago

I assume, regexp-assemble.pl did this?

may be, I didn't checked yet the "raw" format of regex, just the compiled. Of course, we have to investigate it.

Will check this if I find time.

Thanks!

airween commented 4 years ago

I noticed that the order of the 2nd and 3rd group between 942210_1.txt and 942210_2.txt is switched. I assume, regexp-assemble.pl did this?

I didn't find any regex raw data for @allanrbo's first commit. Assume he just modified the argument of @rx directly.

allanrbo commented 4 years ago

Yea I probably (accidentally) just modified the .conf file directly. I can see in later commits I realized this mistake and modified the correct regex-assemble file, but after I had moved to Mirko's approach. I still think Mirko's approach is much cleaner. I don't agree that a 25% perf increase is significant enough to go for a much more complex and hard-to-maintain regex in this context. If I remember right, these regex are very corner cases, and the perf only becomes an issue with very specifically crafted input (likely attack input). I think we don't need to optimize for high perf for attackers :-) the happy path (non-attacks) is where perf matters.

SpiderLabs / owasp-modsecurity-crs

Perf issue with regexes that start with repeating digits #1708