envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.05k stars 4.82k forks source link

FileSystemHTTPCache crash #29064

Closed pchaseh closed 1 month ago

pchaseh commented 1 year ago

Title: FileSystemHTTPCache crash

Note Envoy's security team was already contacted, and permission to file an issue for a crash was granted:

Since the HTTP Cache filter is still work in progress and not ready for production use the issue can be fixed in open. Please open an issue for the described above problem in https://github.com/envoyproxy/envoy.

Description:

We use an xDS management server to dynamically generate configurations and have Envoy pull them (which is being used as an edge proxy). Since incorporating the use of this filter, we have noticed that the Envoy process completely crashes whenever a new configuration is being loaded and a response for an HTTP request comes in around the same time. What the "new configuration" is can be as simple as changing a route (therefore resulting in an RDS update .

Repro steps:

We don't have clear reproduction steps for this issue, however as mentioned it seems to be some kind of race condition relating to the HTTP cache filter when a response passes through it while Envoy is simultaneously updating its configuration. The docker debug-dev-9c1dcef98fe3982e26ec9e24ccd19de39862949d image is being used, but it should be possible to replicate in both earlier and later versions

Admin and Stats Output:

N/A

Config:

Not included because this would be over 20 MB, and the only relevant details should be:

  • I'm using the FileSystemHTTPCache filter for all responses passing through Envoy
  • Envoy gets it configuration via an xDS management server

Logs:

crash-redacted-envoy.zip

Call Stack:

{"log":"addr2line: DWARF error: could not find abbrev number 190518\n","stream":"stderr","time":"2023-08-14T19:26:54.500126192Z"}
{"log":"addr2line: DWARF error: could not find abbrev number 190518\n","stream":"stderr","time":"2023-08-14T19:26:54.550418586Z"}
{"log":"addr2line: DWARF error: could not find abbrev number 190518\n","stream":"stderr","time":"2023-08-14T19:26:54.601255321Z"}
{"log":"addr2line: DWARF error: could not find abbrev number 190518\n","stream":"stderr","time":"2023-08-14T19:26:54.653538255Z"}
{"log":"[2023-08-14 19:26:54.379][123][debug][cache_filter] [source/extensions/filters/http/cache/cache_filter.cc:98] [Tags: \"ConnectionId\":\"5\",\"StreamId\":\"510665027642276531\"] CacheFilter::decodeHeaders starting lookup\n","stream":"stdout","time":"2023-08-14T19:27:02.778483393Z"}
{"log":"[2023-08-14 19:26:54.379][38][critical][backtrace] [./source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x55f0d43f9570\n","stream":"stdout","time":"2023-08-14T19:27:02.778499784Z"}
{"log":"[2023-08-14 19:26:54.379][38][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):\n","stream":"stdout","time":"2023-08-14T19:27:02.778502874Z"}
{"log":"[2023-08-14 19:26:54.379][38][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: 9c1dcef98fe3982e26ec9e24ccd19de39862949d/1.28.0-dev/Clean/RELEASE/BoringSSL\n","stream":"stdout","time":"2023-08-14T19:27:02.778505314Z"}
{"log":"[2023-08-14 19:26:54.380][38][critical][backtrace] [./source/server/backtrace.h:96] #0: __restore_rt [0x7fd40395b420]-\u003e[0x29e32e5c7420] ??:0\n","stream":"stdout","time":"2023-08-14T19:27:02.778507784Z"}
{"log":"[2023-08-14 19:26:54.387][38][critical][backtrace] [./source/server/backtrace.h:96] #1: Envoy::Extensions::Common::AsyncFiles::AsyncFileActionWithResult\u003c\u003e::execute() [0x55f0d5d5610a]-\u003e[0x9c210a] snapshot.cc:?\n","stream":"stdout","time":"2023-08-14T19:27:02.778510524Z"}
{"log":"[2023-08-14 19:26:54.393][38][critical][backtrace] [./source/server/backtrace.h:96] #2: Envoy::Extensions::Common::AsyncFiles::AsyncFileManagerThreadPool::resolveActions() [0x55f0d5d57b29]-\u003e[0x9c3b29] snapshot.cc:?\n","stream":"stdout","time":"2023-08-14T19:27:02.778518394Z"}
{"log":"[2023-08-14 19:26:54.402][38][critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::Extensions::Common::AsyncFiles::AsyncFileManagerThreadPool::worker() [0x55f0d5d57a2e]-\u003e[0x9c3a2e] snapshot.cc:?\n","stream":"stdout","time":"2023-08-14T19:27:02.778521044Z"}
{"log":"[2023-08-14 19:26:54.412][38][critical][backtrace] [./source/server/backtrace.h:96] #4: std::__1::__thread_proxy\u003c\u003e() [0x55f0d5d58cb1]-\u003e[0x9c4cb1] snapshot.cc:?\n","stream":"stdout","time":"2023-08-14T19:27:02.778523694Z"}
{"log":"[2023-08-14 19:26:54.413][38][critical][backtrace] [./source/server/backtrace.h:96] #5: start_thread [0x7fd40394f609]-\u003e[0x29e32e5bb609] ??:0\n","stream":"stdout","time":"2023-08-14T19:27:02.778526254Z"}
yanavlasov commented 1 year ago

@ravenblackx for comments

ravenblackx commented 1 year ago

I'll look into it.

Does the xds server in question deliver updates as individual changes, or does it work more like "here's a whole new config"?

Can you include the section of the config that relates to the cache filter? (Or at least the file system cache extension part?)

pchaseh commented 1 year ago

The xDS server re-sends the entire config on each change. I've attached an example configuration that the control plane serves to Envoy when having a single website behind the edge proxy (fetched from the config_dump admin endpoint) configdump-envoy.txt

ravenblackx commented 1 year ago

Awesome, thanks! It being a full config being pushed means it's probably something I can make sense of rather than something mysterious.

ravenblackx commented 1 year ago

I think this should be fixed now (once merged past those two changes). If the issue still happens, please reopen and I'll have another shot!

pchaseh commented 1 year ago

@ravenblackx Since having the cache filter out of my filter chain for 2 weeks I've had no crashes. I upgraded to a96e0f0c2d504a1a0f1780779e9f7cd1b8ad6ae6 today, re-enabled caching, and the problem has returned.

{"log":"[2023-08-29 20:36:39.752][185][debug][cache_filter] [source/extensions/filters/http/cache/cache_filter.cc:98] [Tags: \"ConnectionId\":\"19\",\"StreamId\":\"11794344170990964655\"] CacheFilter::decodeHeaders starting lookup\n","stream":"stdout","time":"2023-08-29T20:36:49.749245541Z"}
{"log":"[2023-08-29 20:36:39.752][90][critical][backtrace] [./source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x55819d8bd282\n","stream":"stdout","time":"2023-08-29T20:36:49.749248401Z"}
{"log":"[2023-08-29 20:36:39.752][90][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):\n","stream":"stdout","time":"2023-08-29T20:36:49.749250742Z"}
{"log":"[2023-08-29 20:36:39.752][90][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: a96e0f0c2d504a1a0f1780779e9f7cd1b8ad6ae6/1.28.0-dev/Clean/RELEASE/BoringSSL\n","stream":"stdout","time":"2023-08-29T20:36:49.749253052Z"}
{"log":"[2023-08-29 20:36:39.754][90][critical][backtrace] [./source/server/backtrace.h:96] #0: __restore_rt [0x7fbfa1237420]-\u003e[0x2a3e02d4e420] ??:0\n","stream":"stdout","time":"2023-08-29T20:36:49.749255402Z"}
{"log":"[2023-08-29 20:36:39.763][7][debug][main] [source/server/server.cc:263] flushing stats\n","stream":"stdout","time":"2023-08-29T20:36:49.749257822Z"}
{"log":"[2023-08-29 20:36:39.763][90][critical][backtrace] [./source/server/backtrace.h:96] #1: Envoy::Extensions::Common::AsyncFiles::AsyncFileActionWithResult\u003c\u003e::execute() [0x55819eeb97fa]-\u003e[0x9d07fa] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749260082Z"}
{"log":"[2023-08-29 20:36:39.771][90][critical][backtrace] [./source/server/backtrace.h:96] #2: Envoy::Extensions::Common::AsyncFiles::AsyncFileManagerThreadPool::resolveActions() [0x55819eebb219]-\u003e[0x9d2219] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749262782Z"}
{"log":"[2023-08-29 20:36:39.779][90][critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::Extensions::Common::AsyncFiles::AsyncFileManagerThreadPool::worker() [0x55819eebb11e]-\u003e[0x9d211e] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749265412Z"}
{"log":"[2023-08-29 20:36:39.786][90][critical][backtrace] [./source/server/backtrace.h:96] #4: std::__1::__thread_proxy\u003c\u003e() [0x55819eebc3a1]-\u003e[0x9d33a1] snapshot.cc:?\n","stream":"stdout","time":"2023-08-29T20:36:49.749267902Z"}
{"log":"[2023-08-29 20:36:39.786][90][critical][backtrace] [./source/server/backtrace.h:96] #5: start_thread [0x7fbfa122b609]-\u003e[0x2a3e02d42609] ??:0\n","stream":"stdout","time":"2023-08-29T20:36:49.749270402Z"}

I unfortunately don't have permission to re-open this, so if someone else could that would be great

ravenblackx commented 1 year ago

Curses, and thanks.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 1 year ago

Still in my queue, but super busy.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 1 year ago

Keep on nagging, stalebot.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 11 months ago

Still in my queue. Your call is important to us, please wait for the next available operator.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 10 months ago

you are [buzzing sound]th in the queue. Please continue to hold for the next available PR.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 9 months ago

Your business is important to us and our next available advisor will be with you shortly. Your estimated wait time is [error, number out of bounds].

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 8 months ago

Might actually have a moment to investigate this today.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 7 months ago

Did not turn out to have time to investigate it after all. Still in the queue though!

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 6 months ago

I don't want to set it no-stalebot because I want to be reminded, but also apologies to anyone else who is subscribed to this thread and just keeps getting this noise. :(

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 5 months ago

Beep beep.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 4 months ago

This might be getting addressed by some of the incidental fixes in https://github.com/envoyproxy/envoy/pull/34595 - there was a lot of bad filter handling. But probably not because the config-push relationship seems unlikely to be tied into those races.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 3 months ago

https://github.com/envoyproxy/envoy/pull/35571 did just the incidental fixes from #34595, and is now on main, which means there's again a chance this is fixed. But not great odds.

pchaseh commented 2 months ago

Thanks for your efforts, we've unfortunately phased Envoy out of our software stack so I won't be able to confirm. Hopefully someone with a similar use case will eventually be capable of putting more recent versions to the test

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ravenblackx commented 1 month ago

I'm now dramatically revamping the whole thing, during which I have seen several bugs that will no longer be the case after the revamp, so this might be remedied, and since I was never able to repro (without creating a whole test environment that I didn't have time to do), and the original reporter no longer cares, I guess this can just be closed until such time as someone else cares if it still reproduces after the refactor.