Closed fionera closed 2 weeks ago
Sorry that I didn't update on this again but I am currently working on getting a talk done. Will try to get the other points done by the end of next week
I think this can be reviewed and tested now. There is still a fd leak when reloading and changing the logger, also when there is a reload with logger change and then an a reload that reverts this, the old file can be raced. There should probably a small wrapper around the logger to provide prevent multiple writers to the same file, but I guess this can wait.
@fionera Can you solve the conflicts here?
Sure will do that
Updated :)
And now without missing to remove an old file 🗡️
Yeah, can we ensure this gets approved/merged?
FYI: This version is working fine for me. Any reason for the delay?
@jcchavezs Can you engage here?
@sts @jcchavezs Can you do a review? I would love to have this merged to start working on fixing/cleaning up the other issues
I'm in no way competent to judge any of this, but as someone who's wanting to switch from modsecurity to Coraza on my HAProxy setup I'm really glad to see someone making the effort to work on this.
So yeah if you guys could take some time to review this PR 😅
Since we're having stability issues with the current main branch (See: https://github.com/corazawaf/coraza-spoa/issues/97) I've been following this rewrite closely hoping it would get merged and fix our issues.
But since merging is stalled since a while I got curious and went ahead with deploying the rewrite branch anyway and putting some of our production traffic against it for a test. But unfortunately, this didn't go well - I couldn't make any conclusions on the stability vs. main because the rewrite branch has a massive memory leak issue and so I couldn't keep it running it for long. So after less than two hours the coraza-spoa daemon's memory use grew above 4 GB and additionally it also used a lot more CPU (about twice as much or more) than the main branch version. And so I ended the experiment again.
My setup uses the CRS v4.7.0 plus a few custom exclusion rules to deal with false positives. Response checking is disabled. Config options and rules were the same when doing this comparison, so the only thing that differs is that one time I ran the coraza-spoa daemon binary built from the rewrite branch and the other time that from the main branch.
First, here's a graph of how the overall system's memory consumption developed after enabling the rewrite branch coraza-spoa daemon. There's an obvious memory leak which forced me to stop the experiment and downgrade to the main branch version again after ~2 hrs. And, after switching back to main, the memory use instantly became stable again:
And also for CPU usage there's a clear difference, so see the CPU consumption graph below that depicts the CPU usage of the coraza-spoa daemon process. While running the rewrite branch version the CPU consumption is consistently much higher and after downgrading to the main branch version it drops again to a much lower level:
What I further noticed is that I got very frequently messages like the following below in our /var/log/messages when running the rewrite branch version:
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
With the main branch coraza-spoa daemon we're also sometimes seeing such "broken pipe" log messages but they are very rare, so they happen just every couple of days. But while running the rewrite branch version this happened all the time, like there were some such messages logged almost every minute. So I'm wondering if there are also functional issues then (beyond just the resource consumption problems). Or if maybe these "broken pipe" errors also cause the memory leak (perhaps under these error conditions memory isn't properly released?).
But so unfortunately, I have no clue what the root cause of these issues might be. Perhaps, it's not even the code in the rewrite branch itself, so the rewrite branch also brings in a lot of newer dependencies - hence the memory leak could also be caused by a dependency (though less likely). So I can only share these observations from our first real-world test now and unfortunately it seems the rewrite branch isn't quite ready for general use yet. But hopefully these remaining issues can still be solved now...
Or if maybe these "broken pipe" errors also cause the memory leak.
Perhaps this could help you out to find the root cause of the issue:
"A "broken pipe" error typically indicates that a process was attempting to write data to another process or socket, but the receiving process was closed or became unavailable before the data transfer was complete. This can occur for various reasons, including:
Network Issues: If the communication was happening over a network, a drop in network connectivity or a closed connection can result in a broken pipe. Process Termination: If a receiving process crashes, exits, or is terminated unexpectedly, any ongoing write operation to that process will fail with a broken pipe error. Resource Limits: Resource limitations like memory or file descriptors reaching their maximum capacity can cause one process to close, resulting in a broken pipe. Timeouts: The receiving process may have a timeout set and, upon expiration, could close the connection while data is still being sent."
Did you check with before the fixes I pushed a couple days ago? I found the resource leak which was because of a wrong usage of defer statements. The brokenn pipe issues are happening when haproxy decides to close the connection because of either too long evaluation times or too many requests. I am currently investivating why exactly haproxy does that and will keep you updated.
Did you check with before the fixes I pushed a couple days ago?
Yes, I used the latest version available (commit 6ea3e43)
I cannot reproduce this behaviour on my side. Can you run the newest commit with the memory profile flag?
Did you check with before the fixes I pushed a couple days ago?
Yes, I used the latest version available (commit 6ea3e43)
Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)
Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)
I've tried the latest rewrite branch code and I have partially good news: The memory leak is gone, so I couldn't reproduce this anymore. Also the CPU usage was much better and on around the same level than the version from current main branch. So it looks like the memory leak previously was also eating CPU cycles and that's why CPU usage was higher. So the resource usage issues seem to be solved now, not worse than main branch anymore.
However, I'm still seeing "broken pipe" etc. output in /var/log/messages frequently which doesn't happen when running the code from the main branch:
2024-09-30T17:26:00.221389+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too
2024-09-30T17:26:00.221466+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019387+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: write: broken pipe
2024-09-30T17:27:46.019485+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019564+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: use of closed network connection
2024-09-30T17:29:05.046278+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:05.046511+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683271+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683348+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 write tcp 127.0.0.1:9000->127.0.0.1:59350: write: broken pipe
So this is just a short snippet from the log but it's those same three errors ("frame is too big", "use of closed network connection" and "broken pipe") that repeat again and again. And "frame is too big" seems new, so I think I didn't see this before when I tested the earlier version that still had the memory leak.
NOTE: I tested this all still with go 1.21 because I don't have a newer version on AlmaLinux 9 and was too lazy to build the newer one from source. So I've reverted your latest commit d0ba9f8ba405a8aada589de7fb2e4d6472a3843d and also changed the go version in go.mod back to 1.21 to make it work. And I don't know if that's a problem, so if there's anything why go >= 1.23.1 is really required...
Had it running the last 20 days without any major issues: (with go 1.22.3)
I get an error when result processing is enabled ( response_check: true ) :
9:49PM PNC Error handling request error="response id is empty"
spoe-agent coraza-agent
messages coraza-req coraza-res
option var-prefix coraza
option set-on-error error
timeout hello 2s
timeout idle 2m
timeout processing 500ms
use-backend coraza-spoa
log global
spoe-message coraza-req
args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body
event on-frontend-http-request
spoe-message coraza-res
args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body
event on-http-response`
I get an error when result processing is enabled ( response_check: true ) :
9:49PM PNC Error handling request error="response id is empty"
spoe-agent coraza-agent messages coraza-req coraza-res option var-prefix coraza option set-on-error error timeout hello 2s timeout idle 2m timeout processing 500ms use-backend coraza-spoa log global spoe-message coraza-req args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body event on-frontend-http-request spoe-message coraza-res args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body event on-http-response`
The ID argument is using the wrong var. It would have to be txn.coraza.id
I get an error when result processing is enabled ( response_check: true ) : 9:49PM PNC Error handling request error="response id is empty"
The ID argument is using the wrong var. It would have to be txn.coraza.id
Thank you, it works correctly with tnx.coraza.id parameter I got confused by the sample configuration example/haproxy/coraza.cfg from this branch
Merged. Work will continue from here; v0.1 was created to keep the old version. This will be tagged as 1.0.0-experimental
Another go at updating coraza-spoa to a the up to date coraza version and a faster spop library. It also uses upstream e2e tests
Q: Why rewrite instead of incremental changes? A: I didn't had time nor motivation to figure out why some things are how they are. It was just easier to do it like this.
TODOs:
Other PRs that are similar:
90
89
Closes #84 #85 #90 #88 #75