Reimplement Coraza-SPOA

fionera commented 9 months ago

Another go at updating coraza-spoa to a the up to date coraza version and a faster spop library. It also uses upstream e2e tests

Q: Why rewrite instead of incremental changes? A: I didn't had time nor motivation to figure out why some things are how they are. It was just easier to do it like this.

TODOs:

[x] Config files
[x] Allow listening to a unix-socket
[x] Correct logger setup
[ ] Figure out how to work with errors inside the handler
[x] Fixup docker containers

Other PRs that are similar:

90
89

Closes #84 #85 #90 #88 #75

fionera commented 9 months ago

Sorry that I didn't update on this again but I am currently working on getting a talk done. Will try to get the other points done by the end of next week

fionera commented 8 months ago

I think this can be reviewed and tested now. There is still a fd leak when reloading and changing the logger, also when there is a reload with logger change and then an a reload that reverts this, the old file can be raced. There should probably a small wrapper around the logger to provide prevent multiple writers to the same file, but I guess this can wait.

fzipi commented 5 months ago

@fionera Can you solve the conflicts here?

fionera commented 5 months ago

Sure will do that

fionera commented 5 months ago

Updated :)

fionera commented 5 months ago

And now without missing to remove an old file 🗡️

csuka commented 4 months ago

Yeah, can we ensure this gets approved/merged?

superstes commented 2 months ago

FYI: This version is working fine for me. Any reason for the delay?

fzipi commented 2 months ago

@jcchavezs Can you engage here?

fionera commented 2 months ago

@sts @jcchavezs Can you do a review? I would love to have this merged to start working on fixing/cleaning up the other issues

nicolus commented 2 months ago

I'm in no way competent to judge any of this, but as someone who's wanting to switch from modsecurity to Coraza on my HAProxy setup I'm really glad to see someone making the effort to work on this.

So yeah if you guys could take some time to review this PR 😅

mmilitzer commented 1 month ago

Since we're having stability issues with the current main branch (See: https://github.com/corazawaf/coraza-spoa/issues/97) I've been following this rewrite closely hoping it would get merged and fix our issues.

But since merging is stalled since a while I got curious and went ahead with deploying the rewrite branch anyway and putting some of our production traffic against it for a test. But unfortunately, this didn't go well - I couldn't make any conclusions on the stability vs. main because the rewrite branch has a massive memory leak issue and so I couldn't keep it running it for long. So after less than two hours the coraza-spoa daemon's memory use grew above 4 GB and additionally it also used a lot more CPU (about twice as much or more) than the main branch version. And so I ended the experiment again.

My setup uses the CRS v4.7.0 plus a few custom exclusion rules to deal with false positives. Response checking is disabled. Config options and rules were the same when doing this comparison, so the only thing that differs is that one time I ran the coraza-spoa daemon binary built from the rewrite branch and the other time that from the main branch.

First, here's a graph of how the overall system's memory consumption developed after enabling the rewrite branch coraza-spoa daemon. There's an obvious memory leak which forced me to stop the experiment and downgrade to the main branch version again after ~2 hrs. And, after switching back to main, the memory use instantly became stable again: System_Memory_Use

And also for CPU usage there's a clear difference, so see the CPU consumption graph below that depicts the CPU usage of the coraza-spoa daemon process. While running the rewrite branch version the CPU consumption is consistently much higher and after downgrading to the main branch version it drops again to a much lower level: coraza-spoa-CPU_Usage

What I further noticed is that I got very frequently messages like the following below in our /var/log/messages when running the rewrite branch version:

coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection

With the main branch coraza-spoa daemon we're also sometimes seeing such "broken pipe" log messages but they are very rare, so they happen just every couple of days. But while running the rewrite branch version this happened all the time, like there were some such messages logged almost every minute. So I'm wondering if there are also functional issues then (beyond just the resource consumption problems). Or if maybe these "broken pipe" errors also cause the memory leak (perhaps under these error conditions memory isn't properly released?).

But so unfortunately, I have no clue what the root cause of these issues might be. Perhaps, it's not even the code in the rewrite branch itself, so the rewrite branch also brings in a lot of newer dependencies - hence the memory leak could also be caused by a dependency (though less likely). So I can only share these observations from our first real-world test now and unfortunately it seems the rewrite branch isn't quite ready for general use yet. But hopefully these remaining issues can still be solved now...

csuka commented 1 month ago

Or if maybe these "broken pipe" errors also cause the memory leak.

Perhaps this could help you out to find the root cause of the issue:

"A "broken pipe" error typically indicates that a process was attempting to write data to another process or socket, but the receiving process was closed or became unavailable before the data transfer was complete. This can occur for various reasons, including:

Network Issues: If the communication was happening over a network, a drop in network connectivity or a closed connection can result in a broken pipe. Process Termination: If a receiving process crashes, exits, or is terminated unexpectedly, any ongoing write operation to that process will fail with a broken pipe error. Resource Limits: Resource limitations like memory or file descriptors reaching their maximum capacity can cause one process to close, resulting in a broken pipe. Timeouts: The receiving process may have a timeout set and, upon expiration, could close the connection while data is still being sent."

fionera commented 1 month ago

Did you check with before the fixes I pushed a couple days ago? I found the resource leak which was because of a wrong usage of defer statements. The brokenn pipe issues are happening when haproxy decides to close the connection because of either too long evaluation times or too many requests. I am currently investivating why exactly haproxy does that and will keep you updated.

mmilitzer commented 1 month ago

Did you check with before the fixes I pushed a couple days ago?

Yes, I used the latest version available (commit 6ea3e43)

fionera commented 1 month ago

I cannot reproduce this behaviour on my side. Can you run the newest commit with the memory profile flag?

fionera commented 1 month ago

Did you check with before the fixes I pushed a couple days ago?

Yes, I used the latest version available (commit 6ea3e43)

Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)

mmilitzer commented 1 month ago

Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)

I've tried the latest rewrite branch code and I have partially good news: The memory leak is gone, so I couldn't reproduce this anymore. Also the CPU usage was much better and on around the same level than the version from current main branch. So it looks like the memory leak previously was also eating CPU cycles and that's why CPU usage was higher. So the resource usage issues seem to be solved now, not worse than main branch anymore.

However, I'm still seeing "broken pipe" etc. output in /var/log/messages frequently which doesn't happen when running the code from the main branch:

2024-09-30T17:26:00.221389+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too 
2024-09-30T17:26:00.221466+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019387+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: write: broken pipe
2024-09-30T17:27:46.019485+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019564+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: use of closed network connection
2024-09-30T17:29:05.046278+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:05.046511+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683271+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683348+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 write tcp 127.0.0.1:9000->127.0.0.1:59350: write: broken pipe

So this is just a short snippet from the log but it's those same three errors ("frame is too big", "use of closed network connection" and "broken pipe") that repeat again and again. And "frame is too big" seems new, so I think I didn't see this before when I tested the earlier version that still had the memory leak.

NOTE: I tested this all still with go 1.21 because I don't have a newer version on AlmaLinux 9 and was too lazy to build the newer one from source. So I've reverted your latest commit d0ba9f8ba405a8aada589de7fb2e4d6472a3843d and also changed the go version in go.mod back to 1.21 to make it work. And I don't know if that's a problem, so if there's anything why go >= 1.23.1 is really required...

superstes commented 1 month ago

Had it running the last 20 days without any major issues: (with go 1.22.3)

Only a few errors per day in relation to the hits/requests
Memory (100% = 4GB) and CPU (3 cores) utilization are stable:

dknight-bg commented 1 month ago

I get an error when result processing is enabled ( response_check: true ) :

9:49PM PNC Error handling request error="response id is empty"

spoe-agent coraza-agent
    messages coraza-req coraza-res
    option var-prefix coraza
    option set-on-error error
    timeout hello      2s
    timeout idle       2m
    timeout processing 500ms
    use-backend coraza-spoa
    log global

spoe-message coraza-req
    args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body
    event on-frontend-http-request

spoe-message coraza-res
    args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body
    event on-http-response`

fionera commented 1 month ago

I get an error when result processing is enabled ( response_check: true ) :

9:49PM PNC Error handling request error="response id is empty"


spoe-agent coraza-agent

    messages coraza-req coraza-res

    option var-prefix coraza

    option set-on-error error

    timeout hello      2s

    timeout idle       2m

    timeout processing 500ms

    use-backend coraza-spoa

    log global

spoe-message coraza-req

    args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body

    event on-frontend-http-request

spoe-message coraza-res

    args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body

    event on-http-response`

The ID argument is using the wrong var. It would have to be txn.coraza.id

dknight-bg commented 1 month ago

I get an error when result processing is enabled ( response_check: true ) : 9:49PM PNC Error handling request error="response id is empty"

The ID argument is using the wrong var. It would have to be txn.coraza.id

Thank you, it works correctly with tnx.coraza.id parameter I got confused by the sample configuration example/haproxy/coraza.cfg from this branch

jptosso commented 2 weeks ago

Merged. Work will continue from here; v0.1 was created to keep the old version. This will be tagged as 1.0.0-experimental

corazawaf / coraza-spoa

Reimplement Coraza-SPOA #103

90

89