envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.95k stars 4.8k forks source link

Envoy process stuck at initialization stage #16267

Closed charlesxu2000 closed 3 years ago

charlesxu2000 commented 3 years ago

Title: Envoy process stuck at initialization stage

Description:

This happened very randomly. When the envoy process starts, it stuck after "loading listeners" and before "loading tracing configuration", Thus the "starting main dispatch loop" does not happen and connection manager port is not opened.

By using strace tool attached to envoy process, it looks like looping at get entroy. Here is the strace output:

[pid  9507] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  9507] futex(0x204fc48, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  9507] futex(0x204fc98, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999751057} <unfinished ...>
[pid  9503] <... nanosleep resumed> NULL) = 0
[pid  9503] ioctl(45, RNDGETENTCNT, 0x7fff4b708d6c) = 0
[pid  9503] nanosleep({tv_sec=0, tv_nsec=250000000}, NULL) = 0
[pid  9503] ioctl(45, RNDGETENTCNT, 0x7fff4b708d6c) = 0
[pid  9503] nanosleep({tv_sec=0, tv_nsec=250000000}, NULL) = 0
[pid  9503] ioctl(45, RNDGETENTCNT, 0x7fff4b708d6c) = 0
[pid  9503] nanosleep({tv_sec=0, tv_nsec=250000000}, NULL) = 0
[pid  9503] ioctl(45, RNDGETENTCNT, 0x7fff4b708d6c) = 0
[pid  9503] nanosleep({tv_sec=0, tv_nsec=250000000},  <unfinished ...>
[pid  9507] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  9507] futex(0x204fc48, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  9507] futex(0x204fc98, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999764365} <unfinished ...>
[pid  9503] <... nanosleep resumed> NULL) = 0
[pid  9503] ioctl(45, RNDGETENTCNT, 0x7fff4b708d6c) = 0

With the Envoy trace on, it seems it stuck at https://github.com/envoyproxy/envoy/blob/release/v1.13/source/server/configuration_impl.cc#L75

Envoy version: b67c14052c49890a7e3afe614d50979c346c024b/1.13.1/Clean/RELEASE/BoringSSL-FIPS

But I tried 1.17.1, it is the same result.

Linux version is: Linux 3.10.62-ltsi #1 SMP PREEMPT Sat May 1 03:03:01 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Repro steps:

This happened very intermittently.

Logs:

Trace log:

[2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:251] initializing epoch 0 (hot restart version=11.104) [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:253] statically linked extensions: [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.thrift_proxy.filters: envoy.filters.thrift.rate_limit, envoy.filters.thrift.router [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.transport_sockets.upstream: envoy.transport_sockets.alts, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.tap, envoy.transport_sockets.tls, raw_buffer, tls [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.grpc_credentials: envoy.grpc_credentials.aws_iam, envoy.grpc_credentials.default, envoy.grpc_credentials.file_based_metadata [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.clusters: envoy.cluster.eds, envoy.cluster.logical_dns, envoy.cluster.original_dst, envoy.cluster.static, envoy.cluster.strict_dns, envoy.clusters.aggregate, envoy.clusters.dynamic_forward_proxy, envoy.clusters.redis [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.dubbo_proxy.serializers: dubbo.hessian2 [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.resource_monitors: envoy.resource_monitors.fixed_heap, envoy.resource_monitors.injected_resource [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.stats_sinks: envoy.dog_statsd, envoy.metrics_service, envoy.stat_sinks.hystrix, envoy.statsd [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.resolvers: envoy.ip [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.transport_sockets.downstream: envoy.transport_sockets.alts, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.tap, envoy.transport_sockets.tls, raw_buffer, tls [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.health_checkers: envoy.health_checkers.redis [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.dubbo_proxy.route_matchers: default [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.retry_host_predicates: envoy.retry_host_predicates.omit_canary_hosts, envoy.retry_host_predicates.previous_hosts [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.filters.listener: envoy.listener.http_inspector, envoy.listener.original_dst, envoy.listener.original_src, envoy.listener.proxy_protocol, envoy.listener.tls_inspector [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.thrift_proxy.transports: auto, framed, header, unframed [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.tracers: envoy.dynamic.ot, envoy.lightstep, envoy.tracers.datadog, envoy.tracers.opencensus, envoy.tracers.xray, envoy.zipkin [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.filters.udp_listener: envoy.filters.udp_listener.udp_proxy [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.access_loggers: envoy.file_access_log, envoy.http_grpc_access_log, envoy.tcp_grpc_access_log [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.retry_priorities: envoy.retry_priorities.previous_priorities [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.dubbo_proxy.protocols: dubbo [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.udp_listeners: raw_udp_listener [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.dubbo_proxy.filters: envoy.filters.dubbo.router [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.filters.http: envoy.buffer, envoy.cors, envoy.csrf, envoy.ext_authz, envoy.fault, envoy.filters.http.adaptive_concurrency, envoy.filters.http.dynamic_forward_proxy, envoy.filters.http.grpc_http1_reverse_bridge, envoy.filters.http.grpc_stats, envoy.filters.http.header_to_metadata, envoy.filters.http.jwt_authn, envoy.filters.http.on_demand, envoy.filters.http.original_src, envoy.filters.http.rbac, envoy.filters.http.tap, envoy.grpc_http1_bridge, envoy.grpc_json_transcoder, envoy.grpc_web, envoy.gzip, envoy.health_check, envoy.http_dynamo_filter, envoy.ip_tagging, envoy.lua, envoy.rate_limit, envoy.router, envoy.squash [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.thrift_proxy.protocols: auto, binary, binary/non-strict, compact, twitter [2021-05-02 03:07:30.458][13820][info][main] [source/server/server.cc:255] envoy.filters.network: envoy.client_ssl_auth, envoy.echo, envoy.ext_authz, envoy.filters.network.dubbo_proxy, envoy.filters.network.kafka_broker, envoy.filters.network.local_ratelimit, envoy.filters.network.mysql_proxy, envoy.filters.network.rbac, envoy.filters.network.sni_cluster, envoy.filters.network.thrift_proxy, envoy.filters.network.zookeeper_proxy, envoy.http_connection_manager, envoy.mongo_proxy, envoy.ratelimit, envoy.redis_proxy, envoy.tcp_proxy [2021-05-02 03:07:30.464][13820][warning][misc] [source/common/protobuf/utility.cc:441] Using deprecated option 'envoy.api.v2.listener.FilterChain.tls_context' from file listener_components.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details. [2021-05-02 03:07:30.464][13820][warning][misc] [source/common/protobuf/utility.cc:441] Using deprecated option 'envoy.api.v2.Cluster.tls_context' from file cluster.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details. [2021-05-02 03:07:30.464][13820][warning][main] [source/server/server.cc:342] No admin address given, so no admin HTTP server started. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.shrink_heap. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][debug][main] [source/server/overload_manager_impl.cc:184] No overload action is configured for envoy.overload_actions.stop_accepting_connections. [2021-05-02 03:07:30.464][13820][info][main] [source/server/server.cc:455] runtime: layers:

yanavlasov commented 3 years ago

It is hard to tell what is the problem with the information provided. The best approach is to collect a core dump and get a stack trace of where Envoy has stopped.

antoniovicente commented 3 years ago

cc @ggreenway

would be good to get more information including what is trying to get entropy but not finding enough, maybe SSL libraries?

ggreenway commented 3 years ago

I'd never seen that ioctl before and didn't notice that it was crypto related. It looks like you're in this fips-only boringssl code:

https://boringssl.googlesource.com/boringssl/+/1eb7769e1189f90e41b81118d903090dbbe98ea6/crypto/fipsmodule/rand/urandom.c#280

#if defined(BORINGSSL_FIPS) && !defined(URANDOM_BLOCKS_FOR_ENTROPY)
  // In FIPS mode on platforms where urandom doesn't block at startup, we ensure
  // that the kernel has sufficient entropy before continuing. This is
  // automatically handled by getrandom, which requires that the entropy pool
  // has been initialised, but for urandom we have to poll.
  for (;;) {
    int entropy_bits;
    if (ioctl(fd, RNDGETENTCNT, &entropy_bits)) {
      fprintf(stderr,
              "RNDGETENTCNT on /dev/urandom failed. We cannot continue in this "
              "case when in FIPS mode.\n");
      abort();
    }

    static const int kBitsNeeded = 256;
    if (entropy_bits >= kBitsNeeded) {
      break;
    }

    usleep(250000);
  }
#endif  // BORINGSSL_FIPS && !URANDOM_BLOCKS_FOR_ENTROPY
}
antoniovicente commented 3 years ago

Out of curiosity, who is your CPU vendor? Are you running on a VM or on bare metal? Does the machine have spinning disks or only SSDs?

charlesxu2000 commented 3 years ago

It is FIPS enabled Envoy, we run on ESXi. It is most likely due to the low available entropy number.

charlesxu2000 commented 3 years ago

The root cause for this issue is due to the number of available bits for entropy is low. Envoy boring SSL requires at least 256 bits. In our VM, it is below that. By increase the available entropy number, Envoy comes up.