Closed hishamhm closed 4 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 87.73912%. Comparing base (
b19d405
) to head (f3134f2
).
I spent some time on this, the fix is clear but I don't think I can have it ready by tomorrow. I initially thought it would be doable so I wrote it on main, but it is larger than I expected. In the fix I changed the dispatch behavior to not stop the proxy_wasm runloop as before on error, and let an individual dispatch fail while others are still running (and may succeed). We can't really return 500 after only one call failed, that isn't very consistent. We also can't return 500 for "all dispatch failed", that's pretty hard to track and also very arbitrary... I may need to check Envoy behavior. Overall this behavior change to keep calls running needs more tests which I haven't written, but it also causes a couple new failures in the FFI as well, at that point it makes me want to write the fix against the Lua bridge refactor instead, or else it will be a big waste of time to rebase it.
In the fix I changed the dispatch behavior to not stop the proxy_wasm runloop as before on error, and let an individual dispatch fail while others are still running (and may succeed). We can't really return 500 after only one call failed, that isn't very consistent. We also can't return 500 for "all dispatch failed", that's pretty hard to try and also very arbitrary... I may need to check Envoy behavior.
Yes, regardless of this bug I was going to discuss this behavior with you as well. In principle, failing a dispatch shouldn't trigger a failure, it's something that I'd definitely want to catch and handle in Datakit (one could think of flows that go, "try this API URL, if that fails try this other one", etc.)
Replaced with #546. @hishamhm I would like for the datakit filter to be tested with #546 before merging it, but the FFI breaking change must also be applied to the Gateway (proxy_wasm.start()
calls are to be removed, but that should be all). cc @flrgh for heads-up on the upcoming FFI change as well.
This is a minimal reproducer to a segfault I originally triggered with Datakit, where you can get a crash when performing multiple dispatch calls with failing connections. The first dispatch is terminated with
dispatch failed: tcp socket - Connection refused
, and this seems to free resources that are still being used by the second dispatch, leading to memory corruption and a segfault.This looks similar to #528 (both include multiple dispatches and the Valgrind errors refer to accessing data freed by
ngx_http_finalize_connection
, etc.; both might have the same root cause), but in that case I didn't getdispatch failed: tcp socket
in the full log.