Closed pchaseh closed 1 month ago
@ravenblackx for comments
I'll look into it.
Does the xds server in question deliver updates as individual changes, or does it work more like "here's a whole new config"?
Can you include the section of the config that relates to the cache filter? (Or at least the file system cache extension part?)
The xDS server re-sends the entire config on each change. I've attached an example configuration that the control plane serves to Envoy when having a single website behind the edge proxy (fetched from the config_dump
admin endpoint)
configdump-envoy.txt
Awesome, thanks! It being a full config being pushed means it's probably something I can make sense of rather than something mysterious.
I think this should be fixed now (once merged past those two changes). If the issue still happens, please reopen and I'll have another shot!
@ravenblackx Since having the cache filter out of my filter chain for 2 weeks I've had no crashes. I upgraded to a96e0f0c2d504a1a0f1780779e9f7cd1b8ad6ae6
today, re-enabled caching, and the problem has returned.
{"log":"[2023-08-29 20:36:39.752][185][debug][cache_filter] [source/extensions/filters/http/cache/cache_filter.cc:98] [Tags: \"ConnectionId\":\"19\",\"StreamId\":\"11794344170990964655\"] CacheFilter::decodeHeaders starting lookup\n","stream":"stdout","time":"2023-08-29T20:36:49.749245541Z"}
{"log":"[2023-08-29 20:36:39.752][90][critical][backtrace] [./source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x55819d8bd282\n","stream":"stdout","time":"2023-08-29T20:36:49.749248401Z"}
{"log":"[2023-08-29 20:36:39.752][90][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):\n","stream":"stdout","time":"2023-08-29T20:36:49.749250742Z"}
{"log":"[2023-08-29 20:36:39.752][90][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: a96e0f0c2d504a1a0f1780779e9f7cd1b8ad6ae6/1.28.0-dev/Clean/RELEASE/BoringSSL\n","stream":"stdout","time":"2023-08-29T20:36:49.749253052Z"}
{"log":"[2023-08-29 20:36:39.754][90][critical][backtrace] [./source/server/backtrace.h:96] #0: __restore_rt [0x7fbfa1237420]-\u003e[0x2a3e02d4e420] ??:0\n","stream":"stdout","time":"2023-08-29T20:36:49.749255402Z"}
{"log":"[2023-08-29 20:36:39.763][7][debug][main] [source/server/server.cc:263] flushing stats\n","stream":"stdout","time":"2023-08-29T20:36:49.749257822Z"}
{"log":"[2023-08-29 20:36:39.763][90][critical][backtrace] [./source/server/backtrace.h:96] #1: Envoy::Extensions::Common::AsyncFiles::AsyncFileActionWithResult\u003c\u003e::execute() [0x55819eeb97fa]-\u003e[0x9d07fa] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749260082Z"}
{"log":"[2023-08-29 20:36:39.771][90][critical][backtrace] [./source/server/backtrace.h:96] #2: Envoy::Extensions::Common::AsyncFiles::AsyncFileManagerThreadPool::resolveActions() [0x55819eebb219]-\u003e[0x9d2219] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749262782Z"}
{"log":"[2023-08-29 20:36:39.779][90][critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::Extensions::Common::AsyncFiles::AsyncFileManagerThreadPool::worker() [0x55819eebb11e]-\u003e[0x9d211e] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749265412Z"}
{"log":"[2023-08-29 20:36:39.786][90][critical][backtrace] [./source/server/backtrace.h:96] #4: std::__1::__thread_proxy\u003c\u003e() [0x55819eebc3a1]-\u003e[0x9d33a1] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749267902Z"}
{"log":"[2023-08-29 20:36:39.786][90][critical][backtrace] [./source/server/backtrace.h:96] #5: start_thread [0x7fbfa122b609]-\u003e[0x2a3e02d42609] ??:0\n","stream":"stdout","time":"2023-08-29T20:36:49.749270402Z"}
I unfortunately don't have permission to re-open this, so if someone else could that would be great
Curses, and thanks.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Still in my queue, but super busy.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Keep on nagging, stalebot.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Still in my queue. Your call is important to us, please wait for the next available operator.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
you are [buzzing sound]th in the queue. Please continue to hold for the next available PR.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Your business is important to us and our next available advisor will be with you shortly. Your estimated wait time is [error, number out of bounds].
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Might actually have a moment to investigate this today.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Did not turn out to have time to investigate it after all. Still in the queue though!
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
I don't want to set it no-stalebot because I want to be reminded, but also apologies to anyone else who is subscribed to this thread and just keeps getting this noise. :(
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Beep beep.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This might be getting addressed by some of the incidental fixes in https://github.com/envoyproxy/envoy/pull/34595 - there was a lot of bad filter handling. But probably not because the config-push relationship seems unlikely to be tied into those races.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
https://github.com/envoyproxy/envoy/pull/35571 did just the incidental fixes from #34595, and is now on main
, which means there's again a chance this is fixed. But not great odds.
Thanks for your efforts, we've unfortunately phased Envoy out of our software stack so I won't be able to confirm. Hopefully someone with a similar use case will eventually be capable of putting more recent versions to the test
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
I'm now dramatically revamping the whole thing, during which I have seen several bugs that will no longer be the case after the revamp, so this might be remedied, and since I was never able to repro (without creating a whole test environment that I didn't have time to do), and the original reporter no longer cares, I guess this can just be closed until such time as someone else cares if it still reproduces after the refactor.
Title: FileSystemHTTPCache crash
Note Envoy's security team was already contacted, and permission to file an issue for a crash was granted:
Description:
Repro steps:
Admin and Stats Output:
Config:
Logs:
Call Stack: