Closed thibaultcha closed 3 months ago
Attention: Patch coverage is 88.99614%
with 57 lines
in your changes missing coverage. Please review.
Project coverage is 90.36951%. Comparing base (
277fac6
) to head (b9bb644
).
@hishamhm I would like for the datakit filter to be tested with https://github.com/Kong/ngx_wasm_module/pull/546 before merging it
@thibaultcha I've just tested it and it's still crashing. Running it enough times I do get segfaults, but every run is reporting memory access errors consistently.
These are the logs from a Gateway run with Valgrind. This particular run didn't crash, but Valgrind reported reads on data that was freed by ngx_proxy_wasm_dispatch_calls_cancel
:
pr-546-kong-error.log
pr-546-kong-valgrind.log
This was running a datakit configuration with 2 parallel calls that fail (I didn't spin up the other localhost upstream).
@hishamhm What is the specific reproducible example for this? It does not trigger in the test given with this PR, and seems to trigger when send_local_response
is used at some point (as per the Valgrind report). I tried updating that test with:
location /t {
proxy_wasm hostcalls 'on=request_body \
test=/t/dispatch_http_call \
host=127.0.0.1:1 \
ncalls=2';
proxy_wasm hostcalls 'on=request_body \
test=/t/echo/body';
echo ok;
}
But it does not trigger any memory issues.
@thibaultcha Sure, let me try to get the same filter crash without the gateway.
Took me a while but I got a consistent repro running without the gateway: https://github.com/Kong/ngx_wasm_module/actions/runs/9133572211/job/25117356096
This branch/commit was intended only for sharing the test case: https://github.com/Kong/ngx_wasm_module/commit/495c13dcd2aaf5273d9e75a121cfec4d8c7793bd (Loading Datakit on Wasmtime with Valgrind takes a couple of minutes, so I enabled the cache in the testcase as a quality-of-life tweak for speeding up debugging runs...)
I think we can eventually isolate the offending proxy-wasm calls and produce a hostcalls-based testcase, but to go one step at a time I wanted to get the error reproduced without the Gateway on Datakit first.
Looking at the logs, I think the triggering condition is to dispatch a call, then in the same handler (in this case, on_request_headers), trigger a local response, before the dispatch callback gets a chance to run.
I think I got a minimized test case. Branch tests/interrupted-dispatch
on top of this PR — this test: https://github.com/Kong/ngx_wasm_module/commit/6cb6899c75b70370fac89f5dc47e246ef5c19ab7
Still waiting for the CI outcome here, but it produces the Valgrind reports locally.
@hishamhm Thanks for the test! This turned out to be yet another thing!
@hishamhm Would you give it another try with the current state of this branch? Except for a small failure in dynamic builds I'm still investigating it all runs green, so I hope it fixes everything now.
Ok I fixed the last problem but @flrgh found another problem in the Kong PR which I also am looking at now.
@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the spec/02-integration/20-wasm/
suite passing locally.
@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the
spec/02-integration/20-wasm/
suite passing locally.
Nice! Looking good with 75e101780f09ade9a4bada0aa8340c0f6eb74f97 at my end too.
I'll give this branch a spin later today!
I have found one more bug that I am trying to get rid of before merging this.
@flrgh @hishamhm Merged! I think in time for the Gateway patch release, hopefully.
Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads, and refactor of dispatch calls failures to continue executing the request. Replaces #539, #545, #523.
The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via
ngx.thread
). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup.List of significant changes for this refactor:
proxy_wasm.start()
FFI function is removed. Onlyproxy_wasm.attach()
is now necessary, and the filter chain is only resumed once the ngx_http_wasm_modulerewrite
oraccess
phases are entered. Prior,proxy_wasm.start()
would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding.ngx.semaphore
API can be used in the Lua bridge. The default Lua resolver now has synchronization enabled.sock->env
member is now a pointer to the request'senv
instead of a copy so as to manipulate theenv->state
control variable.wasm_call
directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality.rctx->resume_handler
pointer holds the resume entry point back from yielding facilities intongx_http_core_run_phases
. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code.Fix #524 Fix #528