Open dvershinin opened 4 weeks ago
200bytes only? That sound awful. Do you have benchmarks that can be shared?
CC: @jptosso
Also, what is the connector? If this is custom code, any chance you could share a pprof profile?
@fzipi @jptosso actually this data is achieved by running coraza-benchmark with latest CRS ruleset:
Engine Case Status TraObj p1 p2 p3 p4 p5 Overall Req/s CPU Req/s per Core
coraza 0-0-1 Body str 10 200 9080 684712 1823303 127111 213760 60720 2918686 342.619932 4312082 231.906536
coraza 0-0-2 Body str 100 200 6440 481942 2584924 70512 109520 47500 3300838 302.953371 3956901 252.723027
coraza 0-0-3 Body str 500 200 5870 411450 5128468 66020 107691 41020 5760519 173.595469 6037103 165.642362
coraza 0-0-4 Body str 1000 200 5320 430591 9206006 68730 107650 39080 9857377 101.446866 10027654 99.724223
coraza 0-0-5 Body str 2000 200 5960 582632 24856610 73730 124501 42160 25685593 38.932331 25926389 38.570740
A test is structured as:
- test_id: 4
stages:
- description: Body str 1000
input:
method: POST
uri: /this/is?some=query&string=here
version: HTTP/1.1
headers:
Host: www.example.com
Content-Type: application/x-www-form-urlencoded
Accept: text/plain
Accept-Encoding: gzip
Accept-Language: en-US
Accept-Charset: utf-8
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0
Content-Length: 22
data: asdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqweras
response:
status: 200
body: OK
headers:
content-length: 2
Whether Content-Length
request header value is matched to data size, doesn't play a significant role in results having that drop in request rate, it is always there
Thanks for letting us know. We'll try to see what happens there.
That package is completely deprecated. I will try to bring the latest version back to life. Results are nowhere similar to that. In my experience, in production, coraza often runs at least 1.2 to 2 times faster. Also I'm not adding modsec comparison to benchmarks anymore, it's not a good representation
That package is completely deprecated
Are you referring to coraza-benchmark?
Results are nowhere similar to that. In my experience, in production, coraza often runs at least 1.2 to 2 times faster.
Have you been testing varying request body payload (POST
requests)?
Also, I'd like to know why is modsec comparison not representative in your opinion.
Yup, coraza-benchmarks Please do not use that package, and if you want to compare both projects, use a connector like the HTTP middleware or caddy and a http benchmark tool, then you can compare it against modsec and nginx Static benchmarks are not the right way to compare both projects. It's not representative at all.
We should document this process and archive that repo
@jptosso @fzipi at present still using coraza-benchmark to pinpoint why increased request body causes significant performance drop. Running only tests with request body payload as in my previous message and increasing it, coraza-benchmark is instrumented with pprof:
Engine Case Status TraObj p1 p2 p3 p4 p5 Overall Req/s CPU Req/s per Core
coraza 0-0-1 Body str 10 200 13540 979962 2954445 187471 294481 123540 4553439 219.614230 5731072 174.487426
coraza 0-0-2 Body str 100 200 12600 1029881 6069029 205070 248991 117811 7683382 130.151019 8118583 123.174204
coraza 0-0-3 Body str 500 200 13220 1195531 18991599 208341 289220 122440 20820351 48.029930 21106477 47.378821
coraza 0-0-4 Body str 1000 200 13090 1168121 34530552 213141 288861 133191 36346956 27.512620 36568223 27.346147
coraza 0-0-5 Body str 2000 200 16020 1132193 58836789 217120 295761 133680 60631563 16.493060 62080670 16.108074
coraza 0-0-6 Body str 4000 200 15000 1299133 142376256 239691 313301 137720 144381101 6.926114 142723833 7.006538
coraza 0-0-7 Body str 8000 200 16290 1316482 286847844 249270 338241 140869 288908996 3.461298 278690927 3.588204
top
in pprof
shows:
(pprof) top
Showing nodes accounting for 47.40s, 78.42% of 60.44s total
Dropped 403 nodes (cum <= 0.30s)
Showing top 10 nodes out of 64
flat flat% sum% cum cum%
15.62s 25.84% 25.84% 19.01s 31.45% regexp.(*machine).add
7.83s 12.95% 38.80% 13.60s 22.50% regexp.(*machine).step
7.55s 12.49% 51.29% 16.75s 27.71% regexp.(*Regexp).tryBacktrack
3.13s 5.18% 56.47% 4.43s 7.33% regexp/syntax.(*Inst).MatchRunePos
2.95s 4.88% 61.35% 3.44s 5.69% regexp.(*bitState).push
2.78s 4.60% 65.95% 2.78s 4.60% regexp.(*bitState).shouldVisit
2.15s 3.56% 69.51% 2.17s 3.59% regexp.(*machine).alloc
2.01s 3.33% 72.83% 2.12s 3.51% strings.ToLower
1.96s 3.24% 76.08% 1.98s 3.28% regexp.(*inputString).step
1.42s 2.35% 78.42% 32.53s 53.82% regexp.(*machine).match
If I look at the profiling data's SVG file, it appears that the path leading up to heavy regex processing is from FindStringSubmatch
. I believe that is being invoked when processing rules with capture
flag. Is this the right direction to check?
Looking forward for your input.
As mentioned before, we will archive coraza-benchmark and provide a new benchmarking mechanism. You are looking at a well-known issue with a few solutions depending on the traffic type. I've found that go re2 is ok but most enterprise users are replacing re2 with https://github.com/corazawaf/coraza-wasilibs which leverages re2 C library Technically we can´t do anything about go re2 implementation, it's in a different layer. In the next few weeks we are meeting in London and we are going to work around this issue. In the meantime, rest assured that even if regexes seem slow in Coraza, using https://github.com/corazawaf/coraza-wasilibs has proven to be extremely performant and in high-traffic environments, it behaves pretty well. I've seen Coraza handling billions of transactions per day
@jptosso thank you, indeed coraza-wasilibs solves this issue although at the cost of significantly increased memory usage. In our test it went up from max RSS usage of around 68 MB (bare Coraza) to 257 MB (Coraza with wasilibs), see below.
Is there a way to improve regexp performance without wasilibs, that is keep memory usage low, or perhaps improve wasilibs for lower memory profile?
Benchmarks:
--- Running on: "AMD Ryzen 7 7700 8-Core Processor", CPU=16, Memory=66GB, Iterations=500, Percentil=0.95, CPU via cgroup=true
Engine Case Status TraObj p1 p2 p3 p4 p5 Overall Req/s CPU Req/s per Core
coraza 0-0-1 Body str 10 200 1913 134983 408627 19737 34445 13205 612910 1631.560914 1009117 990.965369
coraza 0-0-2 Body str 100 200 1773 125365 661972 17763 31269 10971 849113 1177.699552 1009758 990.336298
coraza 0-0-3 Body str 500 200 2314 126929 1932484 17784 32220 11812 2123543 470.911114 2994803 333.911780
coraza 0-0-4 Body str 1000 200 2344 127830 3371825 18334 32410 11742 3564485 280.545436 4007125 249.555479
coraza 0-0-5 Body str 2000 200 2205 130365 6738148 18014 32010 11872 6932614 144.245735 7005914 142.736551
coraza 0-0-6 Body str 4000 200 2335 139282 13320174 20498 34575 13395 13530259 73.908415 13997244 71.442635
coraza 0-0-7 Body str 8000 200 2134 140504 27274507 20268 35196 13275 27485884 36.382312 27981863 35.737435
coraza 0-0-1 Get str 20 200 1893 116569 228028 18996 32621 11762 409869 2439.803937 1009128 990.954567
coraza 0-0-2 Get str 100 200 2344 121728 760817 20238 31479 18845 955451 1046.626148 1016068 984.186098
coraza 0-0-3 Get str 500 200 2555 166282 3348300 26440 33453 18614 3595644 278.114296 15916490 62.827922
coraza 0-0-4 Get str 1000 200 2475 213040 5849721 32772 34114 25397 6157519 162.403072 19392666 51.565886
coraza 0-0-5 Get str 2000 200 2595 390623 10325758 49042 35537 39073 10842628 92.228563 25187205 39.702698
coraza 0-0-1 URlenc 20 200 1763 124673 438292 40646 31679 34485 671538 1489.119007 1009448 990.640429
coraza 0-0-2 URlenc 100 200 2334 127800 887074 41557 32791 36579 1128135 886.418735 2007607 498.105456
coraza 0-0-3 URlenc 500 200 2354 130675 3593831 39734 35607 33052 3835253 260.738992 16494527 60.626170
coraza 0-0-4 URlenc 1000 200 2465 132037 6028086 32251 33543 24386 6252768 159.929171 19566307 51.108265
coraza 0-0-5 URlenc 2000 200 2625 136997 10642552 46237 34044 36158 10898613 91.754795 25020652 39.966984
coraza 0-0-1 Method FOO 403 1913 128501 3487051 320 141 42500 3660426 273.192246 4008881 249.446167
coraza 0-0-1 Simple GET 200 1623 114024 304601 40115 31469 33933 525765 1901.990433 1009309 990.776858
coraza 0-0-2 Simple JSON 200 1894 128461 3743942 18284 32150 12403 3937134 253.991863 4010983 249.315442
coraza 0-0-3 Simple URL-encoded 200 1903 125435 428614 18134 31829 11131 617046 1620.624718 1009559 990.531509
coraza 0-0-4 Simple XML 200 1753 127098 2013767 18915 32231 11312 2205076 453.499108 3000685 333.257240
real 0m51.809s
user 1m0.453s
sys 0m0.475s
CPU: 60925050802
CPU usage: 60925050802
Max RSS usage: 69940 kB (68 MB)
Average RSS usage: 65286 kB (63 MB)
Max Cache usage: 12100 kB (11 MB)
Average CACHE usage: 12076 kB (11 MB)
--- Running on: "AMD Ryzen 7 7700 8-Core Processor", CPU=16, Memory=66GB, Iterations=500, Percentil=0.95, CPU via cgroup=true
Engine Case Status TraObj p1 p2 p3 p4 p5 Overall Req/s CPU Req/s per Core
coraza 0-0-1 Body str 10 200 2114 161453 532388 20348 50755 13525 780583 1281.093747 1008897 991.181459
coraza 0-0-2 Body str 100 200 2195 149420 521298 18625 47429 12613 751580 1330.530349 1008807 991.269886
coraza 0-0-3 Body str 500 200 1924 147367 711875 18685 46938 11601 938390 1065.655005 1009157 990.926090
coraza 0-0-4 Body str 1000 200 1713 146696 941797 17513 46427 11371 1165517 857.988343 2005202 498.702874
coraza 0-0-5 Body str 2000 200 2184 150733 1437857 18174 47359 12383 1668690 599.272483 2007254 498.193054
coraza 0-0-6 Body str 4000 200 2074 149089 2390483 18725 47459 11411 2619241 381.789992 3009770 332.251302
coraza 0-0-7 Body str 8000 200 1784 160000 4273966 19677 50134 12774 4518335 221.320464 5007219 199.711656
coraza 0-0-1 Get str 20 200 1583 127189 256080 17573 46968 11000 460393 2172.057351 1008909 991.169669
coraza 0-0-2 Get str 100 200 1813 145082 1041844 19476 48381 18044 1274640 784.535241 2006533 498.372068
coraza 0-0-3 Get str 500 200 2214 195727 4246135 24956 49313 18635 4536980 220.410934 5873533 170.255279
coraza 0-0-4 Get str 1000 200 2375 256932 8634936 31950 50866 24516 9001575 111.091670 13879031 72.051140
coraza 0-0-5 Get str 2000 200 2495 365055 15989199 45064 52629 36829 16491271 60.638140 23321215 42.879413
coraza 0-0-1 URlenc 20 200 2835 159098 640611 43061 52609 36538 934752 1069.802472 1010120 989.981388
coraza 0-0-2 URlenc 100 200 2244 151133 1249674 41338 48711 35527 1528627 654.181825 2007427 498.150120
coraza 0-0-3 URlenc 500 200 2334 155221 4331845 42570 50184 35967 4618121 216.538285 5015874 199.367049
coraza 0-0-4 URlenc 1000 200 2525 158106 8670874 41839 50675 35347 8959366 111.615041 14777208 67.671782
coraza 0-0-5 URlenc 2000 200 2405 161323 16317125 42821 52569 36388 16612631 60.195161 24097587 41.497931
coraza 0-0-1 Method FOO 403 1784 150663 972343 291 150 42310 1167541 856.500971 1997687 500.578920
coraza 0-0-1 Simple GET 200 1794 133691 383900 41478 48882 34976 644721 1551.058520 1009138 990.944747
coraza 0-0-2 Simple JSON 200 1884 149751 999154 40787 48521 35176 1275273 784.145826 2006522 498.374800
coraza 0-0-3 Simple URL-encoded 200 1763 147246 579788 41317 48000 34776 852890 1172.484142 1008626 991.447772
coraza 0-0-4 Simple XML 200 1853 149361 765836 41468 49433 35576 1043527 958.288573 1009528 990.561926
real 0m39.610s
user 0m40.966s
sys 0m0.451s
CPU: 41414978467
CPU usage: 41414978467
Max RSS usage: 263772 kB (257 MB)
Average RSS usage: 214583 kB (209 MB)
Max Cache usage: 12860 kB (12 MB)
Average CACHE usage: 12817 kB (12 MB)
Good to hear We have also tested redo and hyperscan. Different results for each one. As mentioned before, we will do some research in London during next month, I believe there is a lot of potential in hyperscan
Maybe use https://github.com/VectorCamp/vectorscan instead.
When you say you tested hyperscan - does it mean there is a prototype implementation of coraza with hyperscan? Are there particular challenges with hyperscan implementation? Is there any documentation/explanation on how hyperscan would work with Coraza? My understanding - one of the key benefits of hyperscan is that you can take a large number of regex, and very efficiently match all of them against the text in one go. I am not sure how well it maps into Coraza / the way Coraza handles regex-based scanning.
We will post results about our research after the Owasp project summit this month Right now it's just ideas and short experiments
Using hyperscan would vendor-lock us to Intel, so it got discarded.
In the meantime, we will have to stick to go-re2
Any reasons not to make it switchable opt-in? Would it require architectural changes which will be hard to maintain or it's a just yet another pluggable thing?
The reason I'm asking this is that in our case it will not be a problem to simply ship multiple binaries
hyperscan works with Intel & AMD CPUs Hyperscan's fork, vectorscan, works for & optimized for Intel, AMD, ARM & POWER7+ IBM CPUs
Does that change the decision in any way?
oh, and just in case, we have seen 3x performance improvement with hyperscan/vectorscan on regex matching against files (vs re2). So, it is a rather drastic change in both throughput & CPU usage
That looks interesting. I will run some tests
We are evaluating Coraza WAF and have observed significant performance degradation when processing larger request bodies. Specifically, starting from a request body size of around 200 bytes, Coraza’s throughput decreases by 2 to 3 times compared to ModSecurity using the same OWASP Core Rule Set (CRS) v4.
Is this a known issue with Coraza’s handling of larger request bodies?