Performance drop with larger request body

dvershinin commented 4 weeks ago

We are evaluating Coraza WAF and have observed significant performance degradation when processing larger request bodies. Specifically, starting from a request body size of around 200 bytes, Coraza’s throughput decreases by 2 to 3 times compared to ModSecurity using the same OWASP Core Rule Set (CRS) v4.

Is this a known issue with Coraza’s handling of larger request bodies?

fzipi commented 4 weeks ago

200bytes only? That sound awful. Do you have benchmarks that can be shared?

CC: @jptosso

jptosso commented 4 weeks ago

Also, what is the connector? If this is custom code, any chance you could share a pprof profile?

https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/

dvershinin commented 3 weeks ago

@fzipi @jptosso actually this data is achieved by running coraza-benchmark with latest CRS ruleset:

Engine  Case                      Status  TraObj  p1       p2        p3      p4      p5      Overall   Req/s       CPU       Req/s per Core
coraza  0-0-1 Body str 10         200     9080    684712   1823303   127111  213760  60720   2918686   342.619932  4312082   231.906536
coraza  0-0-2 Body str 100        200     6440    481942   2584924   70512   109520  47500   3300838   302.953371  3956901   252.723027
coraza  0-0-3 Body str 500        200     5870    411450   5128468   66020   107691  41020   5760519   173.595469  6037103   165.642362
coraza  0-0-4 Body str 1000       200     5320    430591   9206006   68730   107650  39080   9857377   101.446866  10027654  99.724223
coraza  0-0-5 Body str 2000       200     5960    582632   24856610  73730   124501  42160   25685593  38.932331   25926389  38.570740

A test is structured as:

- test_id: 4
  stages:
  - description: Body str 1000
    input:
      method: POST
      uri: /this/is?some=query&string=here
      version: HTTP/1.1
      headers:
        Host: www.example.com
        Content-Type: application/x-www-form-urlencoded
        Accept: text/plain
        Accept-Encoding: gzip
        Accept-Language: en-US
        Accept-Charset: utf-8
        Connection: keep-alive
        User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0
        Content-Length: 22
      data: asdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqwer
asasdfqwerasasdfqwerasasdfqwerasasdfqwerasasdfqweras
      response:
        status: 200
        body: OK
        headers:
          content-length: 2

Whether Content-Length request header value is matched to data size, doesn't play a significant role in results having that drop in request rate, it is always there

fzipi commented 3 weeks ago

Thanks for letting us know. We'll try to see what happens there.

jptosso commented 3 weeks ago

That package is completely deprecated. I will try to bring the latest version back to life. Results are nowhere similar to that. In my experience, in production, coraza often runs at least 1.2 to 2 times faster. Also I'm not adding modsec comparison to benchmarks anymore, it's not a good representation

dvershinin commented 3 weeks ago

That package is completely deprecated

Are you referring to coraza-benchmark?

Results are nowhere similar to that. In my experience, in production, coraza often runs at least 1.2 to 2 times faster.

Have you been testing varying request body payload (POST requests)?

Also, I'd like to know why is modsec comparison not representative in your opinion.

jptosso commented 3 weeks ago

Yup, coraza-benchmarks Please do not use that package, and if you want to compare both projects, use a connector like the HTTP middleware or caddy and a http benchmark tool, then you can compare it against modsec and nginx Static benchmarks are not the right way to compare both projects. It's not representative at all.

We should document this process and archive that repo

dvershinin commented 3 weeks ago

@jptosso @fzipi at present still using coraza-benchmark to pinpoint why increased request body causes significant performance drop. Running only tests with request body payload as in my previous message and increasing it, coraza-benchmark is instrumented with pprof:

Engine  Case                 Status  TraObj  p1       p2         p3      p4      p5      Overall    Req/s       CPU        Req/s per Core
coraza  0-0-1 Body str 10    200     13540   979962   2954445    187471  294481  123540  4553439    219.614230  5731072    174.487426
coraza  0-0-2 Body str 100   200     12600   1029881  6069029    205070  248991  117811  7683382    130.151019  8118583    123.174204
coraza  0-0-3 Body str 500   200     13220   1195531  18991599   208341  289220  122440  20820351   48.029930   21106477   47.378821
coraza  0-0-4 Body str 1000  200     13090   1168121  34530552   213141  288861  133191  36346956   27.512620   36568223   27.346147
coraza  0-0-5 Body str 2000  200     16020   1132193  58836789   217120  295761  133680  60631563   16.493060   62080670   16.108074
coraza  0-0-6 Body str 4000  200     15000   1299133  142376256  239691  313301  137720  144381101  6.926114    142723833  7.006538
coraza  0-0-7 Body str 8000  200     16290   1316482  286847844  249270  338241  140869  288908996  3.461298    278690927  3.588204

top in pprof shows:

(pprof) top
Showing nodes accounting for 47.40s, 78.42% of 60.44s total
Dropped 403 nodes (cum <= 0.30s)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
    15.62s 25.84% 25.84%     19.01s 31.45%  regexp.(*machine).add
     7.83s 12.95% 38.80%     13.60s 22.50%  regexp.(*machine).step
     7.55s 12.49% 51.29%     16.75s 27.71%  regexp.(*Regexp).tryBacktrack
     3.13s  5.18% 56.47%      4.43s  7.33%  regexp/syntax.(*Inst).MatchRunePos
     2.95s  4.88% 61.35%      3.44s  5.69%  regexp.(*bitState).push
     2.78s  4.60% 65.95%      2.78s  4.60%  regexp.(*bitState).shouldVisit
     2.15s  3.56% 69.51%      2.17s  3.59%  regexp.(*machine).alloc
     2.01s  3.33% 72.83%      2.12s  3.51%  strings.ToLower
     1.96s  3.24% 76.08%      1.98s  3.28%  regexp.(*inputString).step
     1.42s  2.35% 78.42%     32.53s 53.82%  regexp.(*machine).match

profile001

If I look at the profiling data's SVG file, it appears that the path leading up to heavy regex processing is from FindStringSubmatch. I believe that is being invoked when processing rules with capture flag. Is this the right direction to check?

Looking forward for your input.

jptosso commented 3 weeks ago

As mentioned before, we will archive coraza-benchmark and provide a new benchmarking mechanism. You are looking at a well-known issue with a few solutions depending on the traffic type. I've found that go re2 is ok but most enterprise users are replacing re2 with https://github.com/corazawaf/coraza-wasilibs which leverages re2 C library Technically we can´t do anything about go re2 implementation, it's in a different layer. In the next few weeks we are meeting in London and we are going to work around this issue. In the meantime, rest assured that even if regexes seem slow in Coraza, using https://github.com/corazawaf/coraza-wasilibs has proven to be extremely performant and in high-traffic environments, it behaves pretty well. I've seen Coraza handling billions of transactions per day

dvershinin commented 3 weeks ago

@jptosso thank you, indeed coraza-wasilibs solves this issue although at the cost of significantly increased memory usage. In our test it went up from max RSS usage of around 68 MB (bare Coraza) to 257 MB (Coraza with wasilibs), see below.

Is there a way to improve regexp performance without wasilibs, that is keep memory usage low, or perhaps improve wasilibs for lower memory profile?

Benchmarks:

Bare Coraza

--- Running on: "AMD Ryzen 7 7700 8-Core Processor", CPU=16, Memory=66GB, Iterations=500, Percentil=0.95, CPU via cgroup=true

Engine  Case                      Status  TraObj  p1      p2        p3     p4     p5     Overall   Req/s        CPU       Req/s per Core
coraza  0-0-1 Body str 10         200     1913    134983  408627    19737  34445  13205  612910    1631.560914  1009117   990.965369
coraza  0-0-2 Body str 100        200     1773    125365  661972    17763  31269  10971  849113    1177.699552  1009758   990.336298
coraza  0-0-3 Body str 500        200     2314    126929  1932484   17784  32220  11812  2123543   470.911114   2994803   333.911780
coraza  0-0-4 Body str 1000       200     2344    127830  3371825   18334  32410  11742  3564485   280.545436   4007125   249.555479
coraza  0-0-5 Body str 2000       200     2205    130365  6738148   18014  32010  11872  6932614   144.245735   7005914   142.736551
coraza  0-0-6 Body str 4000       200     2335    139282  13320174  20498  34575  13395  13530259  73.908415    13997244  71.442635
coraza  0-0-7 Body str 8000       200     2134    140504  27274507  20268  35196  13275  27485884  36.382312    27981863  35.737435
coraza  0-0-1 Get str 20          200     1893    116569  228028    18996  32621  11762  409869    2439.803937  1009128   990.954567
coraza  0-0-2 Get str 100         200     2344    121728  760817    20238  31479  18845  955451    1046.626148  1016068   984.186098
coraza  0-0-3 Get str 500         200     2555    166282  3348300   26440  33453  18614  3595644   278.114296   15916490  62.827922
coraza  0-0-4 Get str 1000        200     2475    213040  5849721   32772  34114  25397  6157519   162.403072   19392666  51.565886
coraza  0-0-5 Get str 2000        200     2595    390623  10325758  49042  35537  39073  10842628  92.228563    25187205  39.702698
coraza  0-0-1 URlenc 20           200     1763    124673  438292    40646  31679  34485  671538    1489.119007  1009448   990.640429
coraza  0-0-2 URlenc 100          200     2334    127800  887074    41557  32791  36579  1128135   886.418735   2007607   498.105456
coraza  0-0-3 URlenc 500          200     2354    130675  3593831   39734  35607  33052  3835253   260.738992   16494527  60.626170
coraza  0-0-4 URlenc 1000         200     2465    132037  6028086   32251  33543  24386  6252768   159.929171   19566307  51.108265
coraza  0-0-5 URlenc 2000         200     2625    136997  10642552  46237  34044  36158  10898613  91.754795    25020652  39.966984
coraza  0-0-1 Method FOO          403     1913    128501  3487051   320    141    42500  3660426   273.192246   4008881   249.446167
coraza  0-0-1 Simple GET          200     1623    114024  304601    40115  31469  33933  525765    1901.990433  1009309   990.776858
coraza  0-0-2 Simple JSON         200     1894    128461  3743942   18284  32150  12403  3937134   253.991863   4010983   249.315442
coraza  0-0-3 Simple URL-encoded  200     1903    125435  428614    18134  31829  11131  617046    1620.624718  1009559   990.531509
coraza  0-0-4 Simple XML          200     1753    127098  2013767   18915  32231  11312  2205076   453.499108   3000685   333.257240

real    0m51.809s
user    1m0.453s
sys 0m0.475s
CPU:    60925050802
CPU usage:           60925050802
Max RSS usage:       69940 kB (68 MB)
Average RSS usage:   65286 kB (63 MB)
Max Cache usage:     12100 kB (11 MB)
Average CACHE usage: 12076 kB (11 MB)

With wasilibs

--- Running on: "AMD Ryzen 7 7700 8-Core Processor", CPU=16, Memory=66GB, Iterations=500, Percentil=0.95, CPU via cgroup=true

Engine  Case                      Status  TraObj  p1      p2        p3     p4     p5     Overall   Req/s        CPU       Req/s per Core
coraza  0-0-1 Body str 10         200     2114    161453  532388    20348  50755  13525  780583    1281.093747  1008897   991.181459
coraza  0-0-2 Body str 100        200     2195    149420  521298    18625  47429  12613  751580    1330.530349  1008807   991.269886
coraza  0-0-3 Body str 500        200     1924    147367  711875    18685  46938  11601  938390    1065.655005  1009157   990.926090
coraza  0-0-4 Body str 1000       200     1713    146696  941797    17513  46427  11371  1165517   857.988343   2005202   498.702874
coraza  0-0-5 Body str 2000       200     2184    150733  1437857   18174  47359  12383  1668690   599.272483   2007254   498.193054
coraza  0-0-6 Body str 4000       200     2074    149089  2390483   18725  47459  11411  2619241   381.789992   3009770   332.251302
coraza  0-0-7 Body str 8000       200     1784    160000  4273966   19677  50134  12774  4518335   221.320464   5007219   199.711656
coraza  0-0-1 Get str 20          200     1583    127189  256080    17573  46968  11000  460393    2172.057351  1008909   991.169669
coraza  0-0-2 Get str 100         200     1813    145082  1041844   19476  48381  18044  1274640   784.535241   2006533   498.372068
coraza  0-0-3 Get str 500         200     2214    195727  4246135   24956  49313  18635  4536980   220.410934   5873533   170.255279
coraza  0-0-4 Get str 1000        200     2375    256932  8634936   31950  50866  24516  9001575   111.091670   13879031  72.051140
coraza  0-0-5 Get str 2000        200     2495    365055  15989199  45064  52629  36829  16491271  60.638140    23321215  42.879413
coraza  0-0-1 URlenc 20           200     2835    159098  640611    43061  52609  36538  934752    1069.802472  1010120   989.981388
coraza  0-0-2 URlenc 100          200     2244    151133  1249674   41338  48711  35527  1528627   654.181825   2007427   498.150120
coraza  0-0-3 URlenc 500          200     2334    155221  4331845   42570  50184  35967  4618121   216.538285   5015874   199.367049
coraza  0-0-4 URlenc 1000         200     2525    158106  8670874   41839  50675  35347  8959366   111.615041   14777208  67.671782
coraza  0-0-5 URlenc 2000         200     2405    161323  16317125  42821  52569  36388  16612631  60.195161    24097587  41.497931
coraza  0-0-1 Method FOO          403     1784    150663  972343    291    150    42310  1167541   856.500971   1997687   500.578920
coraza  0-0-1 Simple GET          200     1794    133691  383900    41478  48882  34976  644721    1551.058520  1009138   990.944747
coraza  0-0-2 Simple JSON         200     1884    149751  999154    40787  48521  35176  1275273   784.145826   2006522   498.374800
coraza  0-0-3 Simple URL-encoded  200     1763    147246  579788    41317  48000  34776  852890    1172.484142  1008626   991.447772
coraza  0-0-4 Simple XML          200     1853    149361  765836    41468  49433  35576  1043527   958.288573   1009528   990.561926

real    0m39.610s
user    0m40.966s
sys 0m0.451s
CPU:    41414978467
CPU usage:           41414978467
Max RSS usage:       263772 kB (257 MB)
Average RSS usage:   214583 kB (209 MB)
Max Cache usage:     12860 kB (12 MB)
Average CACHE usage: 12817 kB (12 MB)

jptosso commented 3 weeks ago

Good to hear We have also tested redo and hyperscan. Different results for each one. As mentioned before, we will do some research in London during next month, I believe there is a lot of potential in hyperscan

fzipi commented 3 weeks ago

Maybe use https://github.com/VectorCamp/vectorscan instead.

cloudlinuxadmin commented 2 weeks ago

When you say you tested hyperscan - does it mean there is a prototype implementation of coraza with hyperscan? Are there particular challenges with hyperscan implementation? Is there any documentation/explanation on how hyperscan would work with Coraza? My understanding - one of the key benefits of hyperscan is that you can take a large number of regex, and very efficiently match all of them against the text in one go. I am not sure how well it maps into Coraza / the way Coraza handles regex-based scanning.

jptosso commented 2 weeks ago

We will post results about our research after the Owasp project summit this month Right now it's just ideas and short experiments

jptosso commented 1 week ago

Using hyperscan would vendor-lock us to Intel, so it got discarded.

In the meantime, we will have to stick to go-re2

ssergiienko commented 1 week ago

Any reasons not to make it switchable opt-in? Would it require architectural changes which will be hard to maintain or it's a just yet another pluggable thing?

The reason I'm asking this is that in our case it will not be a problem to simply ship multiple binaries

cloudlinuxadmin commented 1 week ago

hyperscan works with Intel & AMD CPUs Hyperscan's fork, vectorscan, works for & optimized for Intel, AMD, ARM & POWER7+ IBM CPUs

Does that change the decision in any way?

cloudlinuxadmin commented 1 week ago

oh, and just in case, we have seen 3x performance improvement with hyperscan/vectorscan on regex matching against files (vs re2). So, it is a rather drastic change in both throughput & CPU usage

jptosso commented 1 week ago

That looks interesting. I will run some tests

corazawaf / coraza

Performance drop with larger request body #1176

Bare Coraza

With wasilibs