Open eoregua opened 2 years ago
For starters, you had to click through multiple screens that say to not open issues related to crashes on GH and to email the security list. Please do not do this! Thank you.
Given this is unlikely a security issue we can leave this here. Please use an envoy build with symbols (envoy-debug, etc.) so we can have a better idea of what is going on here.
Hi,
Here are the debug logs:
[2022-05-20 10:01:07.014][18][debug][upstream] [source/common/upstream/strict_dns_cluster.cc:178] DNS refresh rate reset for stub-server-1.test-1, refresh rate 5000 ms
[2022-05-20 10:01:07.630][18][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:72] notification: fd: 1 mask: 80 file: ..data
[2022-05-20 10:01:07.630][18][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:88] matched callback: directory: ..data
[2022-05-20 10:01:07.631][18][debug][config] [source/extensions/transport_sockets/tls/ssl_socket.cc:435] Secret is updated.
[2022-05-20 10:01:07.631][18][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:72] notification: fd: 1 mask: 80 file: ..data
[2022-05-20 10:01:07.631][18][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:88] matched callback: directory: ..data
[2022-05-20 10:01:07.631][18][critical][main] [source/exe/terminate_handler.cc:12] std::terminate called! (possible uncaught exception, see trace)
[2022-05-20 10:01:07.631][18][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2022-05-20 10:01:07.631][18][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: af50070ee60866874b0a9383daf9364e884ded22/1.21.1/Clean/RELEASE/BoringSSL
[2022-05-20 10:01:07.641][18][critical][backtrace] [./source/server/backtrace.h:96] #0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0x56049742748e]
[2022-05-20 10:01:07.649][18][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x560497427359]
[2022-05-20 10:01:07.658][18][critical][backtrace] [./source/server/backtrace.h:96] #2: std::__terminate() [0x5604978d79b3]
[2022-05-20 10:01:07.666][18][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x5604971d20fd]
[2022-05-20 10:01:07.674][18][critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x5604971ba1ac]
[2022-05-20 10:01:07.682][18][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x5604971b4ef1]
[2022-05-20 10:01:07.690][18][critical][backtrace] [./source/server/backtrace.h:96] #6: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x5604971b63fc]
[2022-05-20 10:01:07.699][18][critical][backtrace] [./source/server/backtrace.h:96] #7: event_process_active_single_queue [0x5604972c7a90]
[2022-05-20 10:01:07.757][18][critical][backtrace] [./source/server/backtrace.h:96] #8: event_base_loop [0x5604972c64a1]
[2022-05-20 10:01:07.765][18][critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::Server::InstanceImpl::run() [0x560496b3379c]
[2022-05-20 10:01:07.774][18][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::MainCommonBase::run() [0x56049570dea4]
[2022-05-20 10:01:07.782][18][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::MainCommon::main() [0x56049570e716]
[2022-05-20 10:01:07.790][18][critical][backtrace] [./source/server/backtrace.h:96] #12: main [0x56049570a8dc]
[2022-05-20 10:01:07.791][18][critical][backtrace] [./source/server/backtrace.h:96] #13: __libc_start_main [0x7f60e01d12bd]
[2022-05-20 10:01:07.791][18][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x38aba00000012
[2022-05-20 10:01:07.791][18][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2022-05-20 10:01:07.791][18][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: af50070ee60866874b0a9383daf9364e884ded22/1.21.1/Clean/RELEASE/BoringSSL
[2022-05-20 10:01:07.791][18][critical][backtrace] [./source/server/backtrace.h:96] #0: __restore_rt [0x7f60e05a78c0]
[2022-05-20 10:01:07.799][18][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x560497427359]
[2022-05-20 10:01:07.799][18][critical][backtrace] [./source/server/backtrace.h:96] #2: std::__terminate() [0x5604978d79b3]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x5604971d20fd]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x5604971ba1ac]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x5604971b4ef1]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #6: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x5604971b63fc]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #7: event_process_active_single_queue [0x5604972c7a90]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #8: event_base_loop [0x5604972c64a1]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::Server::InstanceImpl::run() [0x560496b3379c]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::MainCommonBase::run() [0x56049570dea4]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::MainCommon::main() [0x56049570e716]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #12: main [0x56049570a8dc]
[2022-05-20 10:01:07.857][18][critical][backtrace] [./source/server/backtrace.h:96] #13: __libc_start_main [0x7f60e01d12bd]
Thanks in advance for the support!
It looks like an uncaught exception but it's hard to understand what is going on here. If you can reproduce this can you run it under a debugger and see exactly where the exception is being thrown?
Yes, I found a way to reproduce the issue systematically, so if you share a new binary for debugging I can run it.
You don't need a new binary, just use the existing binary with symbols.
Hi,
I'm able to reproduce systematically only at deployment time.
I believe that the faulty scenario is when at the first startup are not available all watched directories monitored from SDS (see the attached logs).
Then something going wrong when more secrets, faulty at the beginning, are then updated correctly:
[2022-05-23 12:45:43.912][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-ingress-1-sip-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.912][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-ingress-1-sip-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.914][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-egress-1-ext-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.915][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-egress-1-ext-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.961][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-ingress-1-ext-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.962][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-ingress-1-ext-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.980][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-egress-1-sip-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.980][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-egress-1-sip-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.953][16][debug][config] [source/common/config/filesystem_subscription_impl.cc:61] Filesystem config refresh for /etc/envoy/tls-ingress-1-sip-certificate-sds-secret.yaml
[2022-05-23 12:45:43.954][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/certs/sipTLScerts-ingress-1' file: '' fd: 1
[2022-05-23 12:45:43.954][16][debug][misc] [source/common/filesystem/posix/filesystem_impl.cc:158] Unable to determine canonical path for /certs/sipTLScerts-ingress-1/tls.crt: No such file or directory
[2022-05-23 12:45:43.954][16][warning][config] [source/common/config/filesystem_subscription_impl.cc:43] Filesystem config update rejected: Invalid path: /certs/sipTLScerts-ingress-1/tls.cr
[2022-05-23 12:45:43.982][16][debug][config] [source/common/config/filesystem_subscription_impl.cc:61] Filesystem config refresh for /etc/envoy/tls-egress-1-sip-certificate-sds-secret.yaml
[2022-05-23 12:45:43.982][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/certs/sipTLScerts-egress-1' file: '' fd: 1
[2022-05-23 12:45:43.982][16][debug][misc] [source/common/filesystem/posix/filesystem_impl.cc:158] Unable to determine canonical path for /certs/sipTLScerts-egress-1/tls.crt: No such file or directory
[2022-05-23 12:45:43.982][16][warning][config] [source/common/config/filesystem_subscription_impl.cc:43] Filesystem config update rejected: Invalid path: /certs/sipTLScerts-egress-1/tls.crt
[2022-05-23 12:45:43.992][16][debug][config] [source/common/config/filesystem_subscription_impl.cc:71] Filesystem config update accepted for /etc/envoy/validation-ingress-1-sip-context-sds-secret.yaml: resources {
[type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret] {
name: "validation-ingress-1-context-sip-sds"
validation_context {
trusted_ca {
filename: "/certs/sipTLSca-ingress-1/tls.crt"
}
watched_directory {
path: "/certs/sipTLSca-ingress-1"
}
}
}
}
[2022-05-23 12:46:48.568][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:72] notification: fd: 1 mask: 80 file: ..data
[2022-05-23 12:46:48.568][16][debug][file] [source/common/
[log_crash.txt](https://github.com/envoyproxy/envoy/files/8755113/log_crash.txt)
filesystem/inotify/watcher_impl.cc:88] matched callback: directory: ..data
[2022-05-23 12:46:48.568][16][debug][config] [source/extensions/transport_sockets/tls/ssl_socket.cc:389] Secret is updated.
[2022-05-23 12:46:48.568][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:72] notification: fd: 1 mask: 80 file: ..data
[2022-05-23 12:46:48.568][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:88] matched callback: directory: ..data
[2022-05-23 12:46:48.568][16][debug][config] [source/extensions/transport_sockets/tls/ssl_socket.cc:435] Secret is updated.
Sorry I don't immediately know what is going on. The problem with the trace is that you don't see where the exception was thrown from, so you might need to attach a debugger or get a core dump.
Issue reproduced running the envoy process in a debugger!
Please do your best to narrow down this issue that is systematically reproduced in the scenario already described in the previous comment, thanks a lot!
[2022-05-26 19:37:18.196][34][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:72] notification: fd: 1 mask: 80 file: ..data
[2022-05-26 19:37:18.196][34][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:88] matched callback: directory: ..data
[2022-05-26 19:37:18.196][34][critical][main] [source/exe/terminate_handler.cc:12] std::terminate called! (possible uncaught exception, see trace)
[2022-05-26 19:37:18.196][34][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2022-05-26 19:37:18.196][34][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: af50070ee60866874b0a9383daf9364e884ded22/1.21.1/Clean/RELEASE/BoringSSL
[2022-05-26 19:37:18.206][34][critical][backtrace] [./source/server/backtrace.h:96] #0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0x55555826048e]
[2022-05-26 19:37:18.214][34][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x555558260359]
[2022-05-26 19:37:18.222][34][critical][backtrace] [./source/server/backtrace.h:96] #2: std::__terminate() [0x5555587109b3]
[2022-05-26 19:37:18.231][34][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x55555800b0fd]
[2022-05-26 19:37:18.239][34][critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x555557ff31ac]
[2022-05-26 19:37:18.247][34][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x555557fedef1]
[2022-05-26 19:37:18.256][34][critical][backtrace] [./source/server/backtrace.h:96] #6: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x555557fef3fc]
[2022-05-26 19:37:18.264][34][critical][backtrace] [./source/server/backtrace.h:96] #7: event_process_active_single_queue [0x555558100a90]
[2022-05-26 19:37:18.273][34][critical][backtrace] [./source/server/backtrace.h:96] #8: event_base_loop [0x5555580ff4a1]
[2022-05-26 19:37:18.282][34][critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::Server::InstanceImpl::run() [0x55555796c79c]
[2022-05-26 19:37:18.341][34][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::MainCommonBase::run() [0x555556546ea4]
[2022-05-26 19:37:18.349][34][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::MainCommon::main() [0x555556547716]
[2022-05-26 19:37:18.357][34][critical][backtrace] [./source/server/backtrace.h:96] #12: main [0x5555565438dc]
[2022-05-26 19:37:18.358][34][critical][backtrace] [./source/server/backtrace.h:96] #13: __libc_start_main [0x7ffff70972bd]
Thread 1 "envoy" received signal SIGABRT, Aborted.
__GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) quit
A debugging session is active.
Inferior 1 [process 34] will be killed.
Please do your best to narrow down this issue that is systematically reproduced in the scenario already described in the previous comment, thanks a lot!
I already told you I can't tell what is going on from the stack trace. So either provide a repro that I can run, or you will need to debug further. You can tell gdb to break when an exception is thrown via something like catch throw
or someone along those lines.
bash-4.4# gdb envoy core.envoy.17
GNU gdb (GDB; SUSE Linux Enterprise 15) 11.1
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from envoy...
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/exe/_objs/envoy_main_entry_lib/main.dwo(0xcfea67a77d0e98b3) referenced by CU at offset 0x0 [in module /usr/local/bin/envoy]
[New LWP 17]
[New LWP 18]
[New LWP 19]
[New LWP 24]
[New LWP 32]
[New LWP 40]
[New LWP 21]
[New LWP 29]
[New LWP 25]
[New LWP 50]
[New LWP 45]
[New LWP 33]
[New LWP 47]
[New LWP 51]
[New LWP 27]
[New LWP 28]
[New LWP 46]
[New LWP 35]
[New LWP 49]
[New LWP 39]
[New LWP 37]
[New LWP 30]
[New LWP 20]
[New LWP 34]
[New LWP 48]
[New LWP 44]
[New LWP 43]
[New LWP 55]
[New LWP 54]
[New LWP 23]
[New LWP 31]
[New LWP 38]
[New LWP 53]
[New LWP 36]
[New LWP 26]
[New LWP 41]
[New LWP 42]
[New LWP 52]
[New LWP 22]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by ` -c /etc/envoy/envoy-custom.yaml --service-cluster test --service-node id_1 --l'.
Program terminated with signal SIGABRT, Aborted.
#0 raise (
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/common/signal/_objs/sigaction_lib/signal_action.dwo(0xedf9c1b8e22d486f) referenced by CU at offset 0x19bfb92 [in module /usr/local/bin/envoy]
sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fb8f476fd00 (LWP 17))]
(gdb) bt
#0 raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x000056349637cb56 in Envoy::SignalAction::sigHandler(int, siginfo_t*, void*) () at source/common/signal/signal_action.cc:53
#2 <signal handler called>
#3 __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/exe/_objs/terminate_handler_lib/terminate_handler.dwo(0xd1a2167592af9679) referenced by CU at offset 0x19c4fc8 [in module /usr/local/bin/envoy]
#4 0x00007fb8f382b375 in __GI_abort () at abort.c:79
#5 0x00005634964d049b in Envoy::TerminateHandler::logOnTerminate() const::$_0::operator()() const () at source/exe/terminate_handler.cc:14
#6 0x00005634964d0359 in Envoy::TerminateHandler::logOnTerminate() const::$_0::__invoke() () at source/exe/terminate_handler.cc:11
#7 0x00005634969809b3 in std::__terminate(void (*)()) ()
#8 0x0000563496982746 in __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) ()
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/exe/_objs/main_common_lib/main_common.dwo(0x6f7a243009a35142) referenced by CU at offset 0x30 [in module /usr/local/bin/envoy]
#9 0x00005634969826e0 in __cxa_throw ()
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/common/config/_objs/watched_directory_lib/watched_directory.dwo(0x3c33374e07e1fa8) referenced by CU at offset 0x19bd1f2 [in module /usr/local/bin/envoy]
#10 0x00005634947c4c52 in std::__1::__throw_bad_function_call() () at /opt/llvm/bin/../include/c++/v1/functional:1436
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/common/filesystem/_objs/watcher_lib/watcher_impl.dwo(0x23f2f24caae5f096) referenced by CU at offset 0x19bcfea [in module /usr/local/bin/envoy]
#11 0x000056349627b0fd in std::__1::__function::__func<Envoy::Config::WatchedDirectory::WatchedDirectory(envoy::config::core::v3::WatchedDirectory const&, Envoy::Event::Dispatcher&)::$_0, std::__1::allocator<Envoy::Config::WatchedDirectory::WatchedDirectory(envoy::config::core::v3::WatchedDirectory const&, Envoy::Event::Dispatcher&)::$_0>, void (unsigned int)>::operator()(unsigned int&&) ()
at /opt/llvm/bin/../include/c++/v1/functional:1884
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/common/event/_objs/dispatcher_lib/dispatcher_impl.dwo(0x30077016b56e988d) referenced by CU at offset 0x19bceee [in module /usr/local/bin/envoy]
#12 0x00005634962631ac in Envoy::Filesystem::WatcherImpl::onInotifyEvent() () at /opt/llvm/bin/../include/c++/v1/functional:1885
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/common/event/_objs/dispatcher_lib/file_event_impl.dwo(0xc981e749aa6fde53) referenced by CU at offset 0x19bcf22 [in module /usr/local/bin/envoy]
#13 0x000056349625def1 in std::__1::__function::__func<Envoy::Event::DispatcherImpl::createFileEvent(int, std::__1::function<void (unsigned int)>, Envoy::Event::FileTriggerType, unsigned int)::$_5, std::__1::allocator<Envoy::Event::DispatcherImpl::createFileEvent(int, std::__1::function<void (unsigned int)>, Envoy::Event::FileTriggerType, unsigned int)::$_5>, void (unsigned int)>::operator()(unsigned int&&) () at /opt/llvm/bin/../include/c++/v1/functional:1885
#14 0x000056349625f3fc in Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*)::$_1::__invoke(int, short, void*) () at /opt/llvm/bin/../include/c++/v1/functional:1885
#15 0x0000563496370a90 in event_process_active_single_queue ()
warning: Could not find DWO CU bazel-out/k8-opt/bin/source/server/_objs/server_lib/server.dwo(0xd79a665334c5065d) referenced by CU at offset 0x19b12d6 [in module /usr/local/bin/envoy]
#16 0x000056349636f4a1 in event_base_loop ()
#17 0x0000563495bdc79c in Envoy::Server::InstanceImpl::run() () at source/server/server.cc:872
#18 0x00005634947b6ea4 in Envoy::MainCommonBase::run() () at source/exe/main_common.cc:173
#19 0x00005634947b7716 in Envoy::MainCommon::main(int, char**, std::__1::function<void (Envoy::Server::Instance&)>) () at ./source/exe/main_common.h:100
#20 0x00005634947b38dc in main () at source/exe/main.cc:24
It seems there is some case in which the inotify event fires:
Before the callback is set:
I'm not sure if we just need a guard around calling the callback or if there is something more fundamentally wrong.
cc @htuch who wrote this code originally as well as @kyessenov and @lambdai who are familiar with the secret loading code.
The watched directory is constructed by setSecret()
. The only place we call setSecret()
is in sds_api.cc
which leads to setCallback()
in the same function / call stack. I'm inclined to conclude there is something more fundamentally wrong if we have a null callback (or at least in my understanding here). Maybe another set of eyes can spot what is up.
Hi,
Any update?
I suppose that there is some check ongoing from your side, right?
Can you describe what you do to the files (k8s secrets I assume)? There's probably some race somewhere, which is exaggerated by excessive watching mentioned in https://github.com/envoyproxy/envoy/issues/21518.
Hi,
I don't believe that the scenario is due to excessive watching of directory from SDS in Envoy.
As said in one comment above the issue is reproduced systematically at deployment time.
What happens at deployment in my faulty scenario is that few of these secrets to be monitored are still not created.
So basically envoy added watch for directory for 8 k8s secrets, but for 2 of them File system config update is rejected:
[2022-05-23 12:45:43.912][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-ingress-1-sip-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.912][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-ingress-1-sip-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.914][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-egress-1-ext-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.915][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-egress-1-ext-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.961][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-ingress-1-ext-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.962][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-ingress-1-ext-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.980][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'tls-egress-1-sip-certificate-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.980][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/etc/envoy' file: 'validation-egress-1-sip-context-sds-secret.yaml' fd: 1
[2022-05-23 12:45:43.953][16][debug][config] [source/common/config/filesystem_subscription_impl.cc:61] Filesystem config refresh for /etc/envoy/tls-ingress-1-sip-certificate-sds-secret.yaml
[2022-05-23 12:45:43.954][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/certs/sipTLScerts-ingress-1' file: '' fd: 1
[2022-05-23 12:45:43.954][16][debug][misc] [source/common/filesystem/posix/filesystem_impl.cc:158] Unable to determine canonical path for /certs/sipTLScerts-ingress-1/tls.crt: No such file or directory
[2022-05-23 12:45:43.954][16][warning][config] [source/common/config/filesystem_subscription_impl.cc:43] Filesystem config update rejected: Invalid path: /certs/sipTLScerts-ingress-1/tls.cr
[2022-05-23 12:45:43.982][16][debug][config] [source/common/config/filesystem_subscription_impl.cc:61] Filesystem config refresh for /etc/envoy/tls-egress-1-sip-certificate-sds-secret.yaml
[2022-05-23 12:45:43.982][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:47] added watch for directory: '/certs/sipTLScerts-egress-1' file: '' fd: 1
[2022-05-23 12:45:43.982][16][debug][misc] [source/common/filesystem/posix/filesystem_impl.cc:158] Unable to determine canonical path for /certs/sipTLScerts-egress-1/tls.crt: No such file or directory
[2022-05-23 12:45:43.982][16][warning][config] [source/common/config/filesystem_subscription_impl.cc:43] Filesystem config update rejected: Invalid path: /certs/sipTLScerts-egress-1/tls.crt
Then after few secs I believe that we see a notification updated because missing 2 secrets are created and this in some way is causing the crash:
[2022-05-23 12:47:50.597][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:72] notification: fd: 1 mask: 80 file: ..data
[2022-05-23 12:47:50.597][16][debug][file] [source/common/filesystem/inotify/watcher_impl.cc:88] matched callback: directory: ..data
[2022-05-23 12:47:50.597][16][critical][main] [source/exe/terminate_handler.cc:12] std::terminate called! (possible uncaught exception, see trace)
[2022-05-23 12:47:50.597][16][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2022-05-23 12:47:50.597][16][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: af50070ee60866874b0a9383daf9364e884ded22/1.21.1/Clean/RELEASE/BoringSSL
[2022-05-23 12:47:50.612][16][critical][backtrace] [./source/server/backtrace.h:96] #0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0x560b668ec48e]
[2022-05-23 12:47:50.621][16][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x560b668ec359]
[2022-05-23 12:47:50.630][16][critical][backtrace] [./source/server/backtrace.h:96] #2: std::__terminate() [0x560b66d9c9b3]
[2022-05-23 12:47:50.638][16][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x560b666970fd]
[2022-05-23 12:47:50.653][16][critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x560b6667f1ac]
[2022-05-23 12:47:50.662][16][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x560b66679ef1]
[2022-05-23 12:47:50.671][16][critical][backtrace] [./source/server/backtrace.h:96] #6: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x560b6667b3fc]
[2022-05-23 12:47:50.680][16][critical][backtrace] [./source/server/backtrace.h:96] #7: event_process_active_single_queue [0x560b6678ca90]
[2022-05-23 12:47:50.688][16][critical][backtrace] [./source/server/backtrace.h:96] #8: event_base_loop [0x560b6678b4a1]
[2022-05-23 12:47:50.697][16][critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::Server::InstanceImpl::run() [0x560b65ff879c]
[2022-05-23 12:47:50.755][16][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::MainCommonBase::run() [0x560b64bd2ea4]
[2022-05-23 12:47:50.764][16][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::MainCommon::main() [0x560b64bd3716]
[2022-05-23 12:47:50.773][16][critical][backtrace] [./source/server/backtrace.h:96] #12: main [0x560b64bcf8dc]
[2022-05-23 12:47:50.774][16][critical][backtrace] [./source/server/backtrace.h:96] #13: __libc_start_main [0x7f5dd31842bd]
[2022-05-23 12:47:50.774][16][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x38aba00000010
[2022-05-23 12:47:50.774][16][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2022-05-23 12:47:50.774][16][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: af50070ee60866874b0a9383daf9364e884ded22/1.21.1/Clean/RELEASE/BoringSSL
[2022-05-23 12:47:50.774][16][critical][backtrace] [./source/server/backtrace.h:96] #0: __restore_rt [0x7f5dd355a8c0]
[2022-05-23 12:47:50.782][16][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x560b668ec359]
[2022-05-23 12:47:50.782][16][critical][backtrace] [./source/server/backtrace.h:96] #2: std::__terminate() [0x560b66d9c9b3]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x560b666970fd]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Filesystem::WatcherImpl::onInotifyEvent() [0x560b6667f1ac]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x560b66679ef1]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #6: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x560b6667b3fc]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #7: event_process_active_single_queue [0x560b6678ca90]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #8: event_base_loop [0x560b6678b4a1]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::Server::InstanceImpl::run() [0x560b65ff879c]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::MainCommonBase::run() [0x560b64bd2ea4]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #11: Envoy::MainCommon::main() [0x560b64bd3716]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #12: main [0x560b64bcf8dc]
[2022-05-23 12:47:50.791][16][critical][backtrace] [./source/server/backtrace.h:96] #13: __libc_start_main [0x7f5dd31842bd]
After the initial deployment if I restart many times the POD everything start correctly and I believe because some race conditions is just created at deployment time when 2 of 8 secrets for what is defined a SDS are not yet available.
Full log at deployment time is attached here.
@eoregua thanks for the additional details. What exact version of Envoy are you using?
envoy 1.21.1
I see the issue. The loadFiles()
is throwing an exception after setSecret()
but before the following setCallback()
. In the case of a watched directory, I think the callback for future updates should be set before loadFiles()
, immediately after setSecret()
. This will allow files to be missing initially but then updated later on. I think those are the semantics we want here. That said, I think this needs a careful analysis, since things like secret_data_
won't be updated on an exception and we might enter some problem state between "fully accepted new SDS config" and "rejected".
@htuch thanks for detailed info. So, I'm assuming that now is ongoing a careful analysis on the possible fix, as you said. When you have, you can also a share a BB with me if you want a preliminary feedback on the fix.
I don't have cycles to work on a fix here, but the above analysis should be a good start for whoever takes this up.
Hi,
Randomly but quite frequently just at startup of the POD the envoy process is got a segmentation fault:
Is a known issue? Any idea?