Avoid Double Buffering bodies, relying on Coraza buffer.

M4tteoP commented 1 year ago

Currently, for both request and response bodies, we are relying on two different buffers:

Envoy buffer: Calling return types.ActionPause, we are buffering the body and stop sending it upstream. It is intended because before sending data upstream Coraza has to prove that no malicious patterns are found.
Coraza Buffer: With tx.RequestBodyWriter().Write(body) the body is added to Coraza buffer that will later process it.

The biggest downside of this solution is Envoy's buffer default limit of 1MiB, ending with a 413: Payload Too Large if exceeded. Example request: base64 /dev/urandom | head -c 1200000 > /tmp/file.txt && curl -i 'localhost:8080/anything' -d @/tmp/file.txt

Possible solutions:

Let Coraza buffer the body: Don't buffer Envoy side, but only Coraza side. After the process completes restore the body buffer from the Coraza buffer with RequestBodyReader and append it to the Envoy buffer only if it is okay to send it upstream (AppendHttpRequestBody). In order to implement it we have to:
- Retrieve with GetHttpRequestBody each body frame and add it to the BodyWriter.
- Discard the received frame (Like types.ActionPause but without buffering)
- Populate it back if no interrupt is triggered.
Streaming body analysis: Evaluate upstream (Coraza) the feasibility of implementing a streaming analysis without buffering the whole body.

The same considerations apply to the Response body.

thekief commented 1 year ago

Has there been any progress on this? The streaming body analysis sounds rather awesome and probably addresses a whole another array of problems that is caused by the buffering (e.g. timeouts) for bigger files.

jcchavezs commented 1 year ago

We are changing how we are dealing with buffers. From now on we will only buffer up to Coraza limit (around 100kb) and then let the bodies go straight to upstream. In terms of buffering that is good enough as the buffer is going to be deterministic. Is that good enough for you or you re looking for analysing huge payloads as well?

In any case it would be cool to know if you are using this extension already and what sort of issues you are experiencing besides the timeouts.

On Fri, 20 Jan 2023, 11:52 thekief, @.***> wrote:

Has there been any progress on this? The streaming body analysis sounds rather awesome and probably addresses a whole another array of problems that is caused by the buffering (e.g. timeouts) for bigger files.

— Reply to this email directly, view it on GitHub https://github.com/corazawaf/coraza-proxy-wasm/issues/97#issuecomment-1398215263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXOYARW7UTM4GJPKJYXQL3WTJU5LANCNFSM6AAAAAASG2NPFI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jcchavezs commented 1 year ago

Going forward with truly stream analysis I think we are far away from that. Many rules will return false positive/negative depending on where you split the body, e.g. sending "eva" in one chunk and "l(rm -rf)" in the next one where both have been analysed separately, the only way to get it right is to still evaluate the entire body. I think we just need to do best effort to avoid huge buffering and that would be good enough.

On Fri, 20 Jan 2023, 11:55 José Carlos Chávez, @.***> wrote:

We are changing how we are dealing with buffers. From now on we will only buffer up to Coraza limit (around 100kb) and then let the bodies go straight to upstream. In terms of buffering that is good enough as the buffer is going to be deterministic. Is that good enough for you or you re looking for analysing huge payloads as well?

In any case it would be cool to know if you are using this extension already and what sort of issues you are experiencing besides the timeouts.

On Fri, 20 Jan 2023, 11:52 thekief, @.***> wrote:

Has there been any progress on this? The streaming body analysis sounds rather awesome and probably addresses a whole another array of problems that is caused by the buffering (e.g. timeouts) for bigger files.

— Reply to this email directly, view it on GitHub https://github.com/corazawaf/coraza-proxy-wasm/issues/97#issuecomment-1398215263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXOYARW7UTM4GJPKJYXQL3WTJU5LANCNFSM6AAAAAASG2NPFI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

thekief commented 1 year ago

I am currently working on a PoC, so I do not have some concrete data yet but I will update on it in the future.

Regarding the payload size we are currently internally discussing what the best approach would be. I have have some payloads up to 30 MB. Buffering them completely for analysis is they way I am used by ModSecurity but this also has some rather big drawbacks.

Preferably I would use a streaming solution as it would allow to prevent downstream services to run into timeout due to not receiving data until everything is received. Especially with slow upload speed this tends to be a problem.

thekief commented 1 year ago

Could you go into some details of this sentence:

From now on we will only buffer up to Coraza limit (around 100kb) and then let the bodies go straight to upstream.

Does this mean Coraza repeatedly buffers 100 KB, analyses the data, eventually sending it upstream until the processing has finished?

jcchavezs commented 1 year ago

Sorry I meant 10mb. See https://github.com/corazawaf/coraza/blob/9d77c5a2ef50b45bf042ad249817b4c6d2272313/internal/corazawaf/waf.go#L274.

The directive SecRequestBodyLimit "Configures the maximum request body size Coraza will accept for buffering.", hence beyond that limit we won't do any buffering and beyond that limit, you can either reject the request or process partially up to that limit (see SecRequestBodyLimitAction). The rest of the payload should go straight to upstream without anymore buffering.

thekief commented 1 year ago

Alright, so basically the same handling as with ModSecurity.

Regarding a stream approach:

The naive approach would probably be to copy some few kB at the end of one data blob and attach it to the start of the next data blob. Would that be something worthwhile to evaluate? If so, I will check on my team and maybe we can look into adding an initial PR.

jcchavezs commented 1 year ago

Shall we keep this issue open @M4tteoP? I think with current tuning having two buffers does not seem to be a big deal, also not sure if this is feasible from proxy-wasm side.

corazawaf / coraza-proxy-wasm

Avoid Double Buffering bodies, relying on Coraza buffer. #97