envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.85k stars 4.78k forks source link

Connection Terminated During Envoy Reload #7766

Closed naftulikay closed 5 years ago

naftulikay commented 5 years ago

Title: Connection Terminated During Envoy Reload

Description:

At my organization, we are preparing a large-scale Envoy deployment for serving all network traffic at the edge on ingress to our network. We are using static configuration files on disk and using the Python reloader script to execute hot reloads when our configuration on disk changes.

We have conducted load tests using an internal Python tool which simply uses the Python requests library and a keepalive session to execute GET requests in a loop.

While running this tool, when we execute an Envoy reload, many of our load-testing clients experience a TCP termination seemingly during a request/response flow. After much debugging, it appears that Envoy is not using layer 7 Connection: close methods to tell clients to terminate their keepalive connections, rather it is simply terminating at layer 4 with a TCP close.

Repro steps:

We are using Envoy 1.10.0 on Ubuntu 16.04 in AWS.

I have open-sourced our load-testing tool using the exact Python version, requests version, etc. to reliably reproduce the issue.

Our systemd unit for running Envoy:

envoy.service

[Unit]
Description=Envoy Proxy
Requires=network-online.target
After=network-online.target

[Service]
Type=simple
Environment="ENVOY_CONFIG_FILE=/etc/envoy/envoy.yaml"
Environment="ENVOY_START_OPTS=--use-libevent-buffers 0 --parent-shutdown-time-s 60"
ExecStart=/usr/local/bin/envoy-restarter.py /usr/local/bin/start-envoy.sh
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

We are using the exact Python restarter tool that is currently in master.

The script that we have the Python restarter tool run is this:

#!/bin/bash

exec /usr/local/bin/envoy -c $ENVOY_CONFIG_FILE $ENVOY_START_OPTS --restart-epoch $RESTART_EPOCH

The stacktrace received from Python is:

2019-07-30T19:34:29.147Z [INFO ] tardypoodle.serial: Starting serial execution of requests...
Traceback (most recent call last):
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/bin/python", line 37, in <module>
    exec(compile(__file__f.read(), __file__, "exec"))
  File "/snap/pycharm-community/143/helpers/pydev/pydevd.py", line 2060, in <module>
    main()
  File "/snap/pycharm-community/143/helpers/pydev/pydevd.py", line 2054, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/snap/pycharm-community/143/helpers/pydev/pydevd.py", line 1405, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "/snap/pycharm-community/143/helpers/pydev/pydevd.py", line 1412, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/bin/tardypoodle", line 18, in <module>
    sys.exit(tardypoodle.main())
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/src/tardypoodle/__init__.py", line 36, in main
    tardypoodle.serial.execute(config)
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/src/tardypoodle/serial.py", line 29, in execute
    send(logger=logger, counter=counter, requester=requester, url=config.url, headers=config.headers)
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/src/tardypoodle/generic.py", line 12, in send
    r = requester.get(url, headers=headers)
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/eggs/requests-2.9.1-py2.7.egg/requests/sessions.py", line 480, in get
    return self.request('GET', url, **kwargs)
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/eggs/requests-2.9.1-py2.7.egg/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/eggs/requests-2.9.1-py2.7.egg/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/naftuli/devel/src/github.com/naftulikay/tardypoodle/eggs/requests-2.9.1-py2.7.egg/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

Envoy collect does not work for me, as it complains about not being able to parse the YAML file as JSON apparently.

Config:

---
static_resources:
  listeners:
  - name: listener_http
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 80

    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            name: route_config
            virtual_hosts:
            - name: abc.mycompany.com
              domains: ["abc.mycompany.com", "*.abc.mycompany.com"]
              routes:
                - match:
                    prefix: "/"
                  route:
                    cluster: abc

          http_filters:
          - name: envoy.router

  clusters:
  - name: abc
    connect_timeout: 1.0s
    type: LOGICAL_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: abc.internal.mycompany.com
        port_value: 80
    circuit_breakers:
      thresholds:
        priority: DEFAULT
        max_connections: 100000000
        max_pending_requests: 1000000000
        max_requests: 100000000
        max_retries: 1000000000

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

Logs:

Jul 30 19:31:30 hostname systemd[1]: Reloading Envoy Proxy.
Jul 30 19:31:30 hostname systemd[1]: Reloaded Envoy Proxy.
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:205] initializing epoch 5 (hot restart version=10.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363 size=2654312)
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:207] statically linked extensions:
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:209]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:212]   filters.http: envoy.buffer,envoy.cors,envoy.ext_authz,envoy.fault,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:215]   filters.listener: envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:218]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:220]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:222]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.zipkin
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:225]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:228]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.056][13609][info][main] [source/server/server.cc:234] buffer implementation: new
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][13609][warning][misc] [source/common/protobuf/utility.cc:173] Using deprecated option 'envoy.api.v2.Cluster.hosts' from file cds.proto. This configuration will be removed from Envoy soon. Please see https://github.com/envoyproxy/envoy/blob/master/DEPRECATED.md for details.
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][13609][warning][misc] [source/common/protobuf/utility.cc:173] Using deprecated option 'envoy.api.v2.Cluster.hosts' from file cds.proto. This configuration will be removed from Envoy soon. Please see https://github.com/envoyproxy/envoy/blob/master/DEPRECATED.md for details.
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][13609][warning][misc] [source/common/protobuf/utility.cc:173] Using deprecated option 'envoy.api.v2.Cluster.hosts' from file cds.proto. This configuration will be removed from Envoy soon. Please see https://github.com/envoyproxy/envoy/blob/master/DEPRECATED.md for details.
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][13499][warning][main] [source/server/server.cc:536] shutting down admin due to child startup
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][13499][warning][main] [source/server/server.cc:544] terminating parent process
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][30905][info][main] [source/server/hot_restart_impl.cc:415] shutting down due to child request
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][30905][warning][main] [source/server/server.cc:425] caught SIGTERM
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][30905][info][main] [source/server/server.cc:529] shutting down server instance
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][30905][info][main] [source/server/server.cc:483] main dispatch loop exited
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.067][13609][info][main] [source/server/server.cc:281] admin address: 0.0.0.0:9901
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.068][13609][info][config] [source/server/configuration_impl.cc:50] loading 0 static secret(s)
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.068][13609][info][config] [source/server/configuration_impl.cc:56] loading 4 cluster(s)
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.069][13609][info][config] [source/server/configuration_impl.cc:60] loading 2 listener(s)
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][30905][info][main] [source/server/server.cc:522] exiting
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][13609][info][config] [source/server/configuration_impl.cc:85] loading tracing configuration
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][13609][info][config] [source/server/configuration_impl.cc:105] loading stats sink configuration
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][13609][info][main] [source/server/server.cc:478] starting main dispatch loop
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][13609][info][upstream] [source/common/upstream/cluster_manager_impl.cc:137] cm init: all clusters initialized
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][13609][info][main] [source/server/server.cc:462] all clusters initialized. initializing init manager
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.072][13609][info][config] [source/server/listener_manager_impl.cc:1006] all dependencies initialized. starting workers
Jul 30 19:31:30 hostname envoy-restarter.py[29232]: [2019-07-30 19:31:30.080][13499][info][main] [source/server/server.cc:126] closing and draining listeners
mattklein123 commented 5 years ago

2 things:

1) Upgrade so you have https://github.com/envoyproxy/envoy/pull/6848. 2) Make sure --drain-time-s is less than --parent-shutdown-time-s

naftulikay commented 5 years ago

@mattklein123 Is there a way for me to tell when #6848 landed? Looked through the release notes, having a hard time determining if v1.10 had it.

naftulikay commented 5 years ago

After deploying v1.11.0, I am unable to reproduce the TCP close issue :tada: so I am closing this issue.