Kong / ngx_wasm_module

Nginx + WebAssembly
Apache License 2.0
79 stars 7 forks source link

tests(proxy_wasm) check parallel calls with failing connections #545

Closed hishamhm closed 4 months ago

hishamhm commented 4 months ago

This is a minimal reproducer to a segfault I originally triggered with Datakit, where you can get a crash when performing multiple dispatch calls with failing connections. The first dispatch is terminated with dispatch failed: tcp socket - Connection refused, and this seems to free resources that are still being used by the second dispatch, leading to memory corruption and a segfault.

This looks similar to #528 (both include multiple dispatches and the Valgrind errors refer to accessing data freed by ngx_http_finalize_connection, etc.; both might have the same root cause), but in that case I didn't get dispatch failed: tcp socket in the full log.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 87.73912%. Comparing base (b19d405) to head (f3134f2).

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Kong/ngx_wasm_module/pull/545/graphs/tree.svg?width=650&height=150&src=pr&token=T0PT2Q9IAN&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong)](https://app.codecov.io/gh/Kong/ngx_wasm_module/pull/545?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong) ```diff @@ Coverage Diff @@ ## main #545 +/- ## =================================================== - Coverage 90.05848% 87.73912% -2.31937% =================================================== Files 47 47 Lines 10089 10089 =================================================== - Hits 9086 8852 -234 - Misses 1003 1237 +234 ``` [see 24 files with indirect coverage changes](https://app.codecov.io/gh/Kong/ngx_wasm_module/pull/545/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong) | [Flag](https://app.codecov.io/gh/Kong/ngx_wasm_module/pull/545/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong) | Coverage Δ | | |---|---|---| | [unit](https://app.codecov.io/gh/Kong/ngx_wasm_module/pull/545/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong) | `90.15117% <ø> (ø)` | | | [valgrind](https://app.codecov.io/gh/Kong/ngx_wasm_module/pull/545/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong) | `57.87645% <ø> (-20.13898%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Kong#carryforward-flags-in-the-pull-request-comment) to find out more.
thibaultcha commented 4 months ago

I spent some time on this, the fix is clear but I don't think I can have it ready by tomorrow. I initially thought it would be doable so I wrote it on main, but it is larger than I expected. In the fix I changed the dispatch behavior to not stop the proxy_wasm runloop as before on error, and let an individual dispatch fail while others are still running (and may succeed). We can't really return 500 after only one call failed, that isn't very consistent. We also can't return 500 for "all dispatch failed", that's pretty hard to track and also very arbitrary... I may need to check Envoy behavior. Overall this behavior change to keep calls running needs more tests which I haven't written, but it also causes a couple new failures in the FFI as well, at that point it makes me want to write the fix against the Lua bridge refactor instead, or else it will be a big waste of time to rebase it.

hishamhm commented 4 months ago

In the fix I changed the dispatch behavior to not stop the proxy_wasm runloop as before on error, and let an individual dispatch fail while others are still running (and may succeed). We can't really return 500 after only one call failed, that isn't very consistent. We also can't return 500 for "all dispatch failed", that's pretty hard to try and also very arbitrary... I may need to check Envoy behavior.

Yes, regardless of this bug I was going to discuss this behavior with you as well. In principle, failing a dispatch shouldn't trigger a failure, it's something that I'd definitely want to catch and handle in Datakit (one could think of flows that go, "try this API URL, if that fails try this other one", etc.)

thibaultcha commented 4 months ago

Replaced with #546. @hishamhm I would like for the datakit filter to be tested with #546 before merging it, but the FFI breaking change must also be applied to the Gateway (proxy_wasm.start() calls are to be removed, but that should be all). cc @flrgh for heads-up on the upcoming FFI change as well.