envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.95k stars 4.8k forks source link

Segmentation faults in Lua filter (LuaThreadLocal) #10241

Closed jmuia closed 3 years ago

jmuia commented 4 years ago

Title: Segmentation faults in Lua filter (LuaThreadLocal)

Description: Envoy occasionally crashes, which is unexpected.

Repro steps: Unsure how to reproduce at the moment, it seems spurious (<5 times per day in a QA fleet since collecting core dumps).

It has been observed in:

We haven't collected core dumps from other versions.

Config: We have various Lua filters configured.

Call Stack: From the 1.12.2 build described above:

    #0  Envoy::ThreadLocal::Slot::getTyped<Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal> (this=0xe) at bazel-out/k8-opt/bin/external/envoy/include/envoy/thread_local/_virtual_includes/thread_local_interface/envoy/thread_local/thread_local.h:51
    #1  0x000055e16e7980ec in Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::registerType<Envoy::Extensions::Filters::Common::Lua::BufferWrapper>()::{lambda()#1}::operator()() const (__closure=<optimized out>) at bazel-out/k8-opt/bin/external/envoy/source/extensions/filters/common/lua/_virtual_includes/lua_lib/extensions/filters/common/lua/lua.h:390
    #2  std::_Function_handler<void (), Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::registerType<Envoy::Extensions::Filters::Common::Lua::BufferWrapper>()::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316
    #3  0x000055e16ec9b053 in std::function<void ()>::operator()() const (this=0x7f7fb763b850) at /usr/include/c++/7/bits/std_function.h:706
    #4  Envoy::Event::DispatcherImpl::runPostCallbacks (this=this@entry=0x55e17203cdc0) at external/envoy/source/common/event/dispatcher_impl.cc:222
    #5  0x000055e16ec9b14c in Envoy::Event::DispatcherImpl::run (this=0x55e17203cdc0, type=Envoy::Event::Dispatcher::RunType::Block) at external/envoy/source/common/event/dispatcher_impl.cc:193
    #6  0x000055e16ec951fa in Envoy::Server::WorkerImpl::threadRoutine (this=0x55e172222850, guard_dog=...) at external/envoy/source/server/worker_impl.cc:110
    #7  0x000055e16eff01d5 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706
    #8  Envoy::Thread::ThreadImplPosix::<lambda(void*)>::operator() (__closure=0x0, arg=<optimized out>) at external/envoy/source/common/common/posix/thread_impl.cc:33
    #9  Envoy::Thread::ThreadImplPosix::<lambda(void*)>::_FUN(void *) () at external/envoy/source/common/common/posix/thread_impl.cc:35
    #10 0x00007f7fbba2a6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
    #11 0x00007f7fbb76041d in clone () from /lib/x86_64-linux-gnu/libc.so.6
mattklein123 commented 4 years ago

Is there any more detail you can share about your use of Lua or any of the times that it crashes? Do you use LDS to deliver Lua config? Nothing immediately comes to mind from the core dump. cc @dio

dio commented 4 years ago

ACK, more info on this will be very helpful.

surki commented 4 years ago

@mattklein123 @dio I see a similar crash due to SDS problem (when certs are missing), may be related to this? https://github.com/envoyproxy/envoy/issues/9765#issuecomment-593965601

@jmuia do you have listeners with tls context configured?

mattklein123 commented 4 years ago

Yes I think it's possible that the crash is due to listener teardown issues. We just fixed something similar in a different filter (cc @yuval-k), but it would be nice to confirm that is what is happening.

jmuia commented 4 years ago

Is there any more detail you can share about your use of Lua or any of the times that it crashes?

Uses of Lua include:

Timing of crashes: I need to collect more data here, but the couple of crashes that I've spot checked so far have happened upon a cold start.

...
[2020-03-04 01:38:32.457][2678][info][config] [external/envoy/source/server/listener_manager_impl.cc:707] all dependencies initialized. starting workers
[2020-03-04 01:38:32.479][3604][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x0

Do you use LDS to deliver Lua config?

Yep, we use LDS for this.

Nothing immediately comes to mind from the core dump.

If it would be helpful I can provide the backtrace for all threads from the core dump.

do you have listeners with tls context configured?

Yep, the listener that is configured with a Lua filter also configures tls_context.

--

I've attached below the (sanitized) log output prior to a crash:

[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:251] initializing epoch 0 (hot restart version=11.104)
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:253] statically linked extensions:
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.transport_sockets.downstream: envoy.transport_sockets.alts, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.tap, envoy.transport_sockets.tls, raw_buffer, tls
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.filters.http: envoy.buffer, envoy.cors, envoy.csrf, envoy.ext_authz, envoy.fault, envoy.filters.http.adaptive_concurrency, envoy.filters.http.dynamic_forward_proxy, envoy.filters.http.grpc_http1_reverse_bridge, envoy.filters.http.grpc_stats, envoy.filters.http.header_to_metadata, envoy.filters.http.jwt_authn, envoy.filters.http.on_demand, envoy.filters.http.original_src, envoy.filters.http.rbac, envoy.filters.http.tap, envoy.grpc_http1_bridge, envoy.grpc_json_transcoder, envoy.grpc_web, envoy.gzip, envoy.health_check, envoy.http_dynamo_filter, envoy.ip_tagging, envoy.lua, envoy.rate_limit, envoy.router, envoy.squash
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.thrift_proxy.filters: envoy.filters.thrift.rate_limit, envoy.filters.thrift.router
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.dubbo_proxy.protocols: dubbo
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.filters.network: envoy.client_ssl_auth, envoy.echo, envoy.ext_authz, envoy.filters.network.dubbo_proxy, envoy.filters.network.kafka_broker, envoy.filters.network.local_ratelimit, envoy.filters.network.mysql_proxy, envoy.filters.network.rbac, envoy.filters.network.sni_cluster, envoy.filters.network.thrift_proxy, envoy.filters.network.zookeeper_proxy, envoy.http_connection_manager, envoy.mongo_proxy, envoy.ratelimit, envoy.redis_proxy, envoy.tcp_proxy
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.health_checkers: envoy.health_checkers.redis
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.thrift_proxy.protocols: auto, binary, binary/non-strict, compact, twitter
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.access_loggers: envoy.file_access_log, envoy.http_grpc_access_log, envoy.tcp_grpc_access_log
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.retry_priorities: envoy.retry_priorities.previous_priorities
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.resolvers: envoy.ip
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.dubbo_proxy.route_matchers: default
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.filters.udp_listener: envoy.filters.udp_listener.udp_proxy
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.resource_monitors: envoy.resource_monitors.fixed_heap, envoy.resource_monitors.injected_resource
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.udp_listeners: raw_udp_listener
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.filters.listener: envoy.listener.http_inspector, envoy.listener.original_dst, envoy.listener.original_src, envoy.listener.proxy_protocol, envoy.listener.tls_inspector
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.dubbo_proxy.filters: envoy.filters.dubbo.router
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.retry_host_predicates: envoy.retry_host_predicates.omit_canary_hosts, envoy.retry_host_predicates.previous_hosts
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.dubbo_proxy.serializers: dubbo.hessian2
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.thrift_proxy.transports: auto, framed, header, unframed
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.stats_sinks: envoy.dog_statsd, envoy.metrics_service, envoy.stat_sinks.hystrix, envoy.statsd
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.tracers: envoy.dynamic.ot, envoy.lightstep, envoy.tracers.datadog, envoy.tracers.opencensus, envoy.tracers.xray, envoy.zipkin
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.clusters: envoy.cluster.eds, envoy.cluster.logical_dns, envoy.cluster.original_dst, envoy.cluster.static, envoy.cluster.strict_dns, envoy.clusters.aggregate, envoy.clusters.dynamic_forward_proxy, envoy.clusters.redis
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.transport_sockets.upstream: envoy.transport_sockets.alts, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.tap, envoy.transport_sockets.tls, raw_buffer, tls
[2020-03-04 01:38:25.967][2678][info][main] [external/envoy/source/server/server.cc:255]   envoy.grpc_credentials: envoy.grpc_credentials.aws_iam, envoy.grpc_credentials.default, envoy.grpc_credentials.file_based_metadata
[2020-03-04 01:38:26.069][2678][info][main] [external/envoy/source/server/server.cc:336] admin address: IP:PORT
[2020-03-04 01:38:26.078][2678][info][main] [external/envoy/source/server/server.cc:455] runtime: layers:
  - name: static_layer
    static_layer:
      envoy.deprecated_features:envoy.api.v2.Cluster.tls_context: true
[2020-03-04 01:38:26.082][2678][info][config] [external/envoy/source/server/configuration_impl.cc:62] loading 0 static secret(s)
[2020-03-04 01:38:26.082][2678][info][config] [external/envoy/source/server/configuration_impl.cc:68] loading 5 cluster(s)
[2020-03-04 01:38:26.130][2678][info][config] [external/envoy/source/server/configuration_impl.cc:72] loading 0 listener(s)
[2020-03-04 01:38:26.133][2678][info][config] [external/envoy/source/server/configuration_impl.cc:97] loading tracing configuration
[2020-03-04 01:38:26.133][2678][info][config] [external/envoy/source/server/configuration_impl.cc:116] loading stats sink configuration
[2020-03-04 01:38:26.135][2678][info][main] [external/envoy/source/server/server.cc:550] starting main dispatch loop
[2020-03-04 01:38:26.189][2678][info][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:167] cm init: initializing cds
[2020-03-04 01:38:26.369][2678][info][upstream] [external/envoy/source/common/upstream/cds_api_impl.cc:74] cds: add 72 cluster(s), remove 5 cluster(s)
[2020-03-04 01:38:26.418][2678][info][upstream] [external/envoy/source/common/upstream/cds_api_impl.cc:90] cds: add/update cluster 'a-cluster'
<truncated for brevity cds: add/update cluster ....>
[2020-03-04 01:38:26.418][2678][info][upstream] [external/envoy/source/common/upstream/cds_api_impl.cc:90] cds: add/update cluster 'some-cluster'
[2020-03-04 01:38:28.352][2678][info][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:171] cm init: all clusters initialized
[2020-03-04 01:38:28.365][2678][info][main] [external/envoy/source/server/server.cc:529] all clusters initialized. initializing init manager
[2020-03-04 01:38:30.777][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'a-listener'
[2020-03-04 01:38:30.779][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'other-listener'
[2020-03-04 01:38:31.192][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:31.194][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:31.258][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:31.259][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:31.272][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'listener-with-lua-and-tls-context'
[2020-03-04 01:38:31.288][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'another-listener'
[2020-03-04 01:38:31.293][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:31.294][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'another-listener-2'
[2020-03-04 01:38:31.297][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:31.298][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'another-listener-3'
[2020-03-04 01:38:31.299][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'another-listener-4'
[2020-03-04 01:38:31.300][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'another-listener-5'
[2020-03-04 01:38:32.069][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:32.070][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:32.120][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:32.120][2678][info][lua] [external/envoy/source/extensions/filters/http/lua/lua_filter.cc:510] envoy_on_response() function not found. Lua filter will not hook responses.
[2020-03-04 01:38:32.137][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'listener-with-lua-and-tls-context'
[2020-03-04 01:38:32.457][2678][info][config] [external/envoy/source/server/listener_manager_impl.cc:707] all dependencies initialized. starting workers
[2020-03-04 01:38:32.479][3604][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x0
mattklein123 commented 4 years ago

Possibly related: https://github.com/envoyproxy/envoy/issues/9940

yuval-k commented 4 years ago

looking at the code and stack trace i imagine that the issue is capturing this in this line: https://github.com/envoyproxy/envoy/blob/eb894d9d0e5373469286c9fa1c91045139176a1e/source/extensions/filters/common/lua/lua.h#L390

an quick solution would be to pass tls_slot_ by copy instead of this do note that note above is just from reviewing code, I didn't test this.

wbpcode commented 4 years ago

The reason is that when the worker executes registerType, the ThreadLocalState object captured by the callback function has been destructed. I encountered this problem when trying to implement routing granularity Lua scripts #11235. This problem is triggered when you create a ThreadLocalState object during initialization and immediately destruct it. Your Listener listener-with-lua-and-tls-context was updated (but why?) during the initialization phase, which means that the old listener configuration was destructed, so this problem was triggered.

[2020-03-04 01:38:31.272][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'listener-with-lua-and-tls-context'
.....................
[2020-03-04 01:38:32.137][2678][info][upstream] [external/envoy/source/server/lds_api.cc:73] lds: add/update listener 'listener-with-lua-and-tls-context'
wbpcode commented 4 years ago

@jmuia You can check to see if this problem has been fixed by #11944 . If it has been fixed, we can close this issue. If this issue still exists, I hope you can provide more information, such as how to reproduce it.

jmuia commented 3 years ago

Thanks @wbpcode.

The issue was spurious and we never found a repro. I'm not sure yet what conditions changed in our environment, but we stopped triggering this segfault without deploying https://github.com/envoyproxy/envoy/pull/11944 internally.

I'll close this issue for now since you've merged https://github.com/envoyproxy/envoy/pull/11944 and will re-open if it re-surfaces.