DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.2k forks source link

ECS Fargate: Agent shuts down after panicking, health check seems to still be passing #6570

Open xsc opened 4 years ago

xsc commented 4 years ago

Output of the info page (if this is a bug)

Sorry, access to this is limited right now; hopefully the below description still helps.

Describe what happened:

The datadog agent stopped working after a panic has been encountered:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                                                                                 message                                                                                                                                                  |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1603746245218 | 2020-10-26 21:04:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746294953 | 2020-10-26 21:04:54 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received                                                                                                                                                                                                    |
| 1603746295226 | 2020-10-26 21:04:55 UTC | PROCESS | INFO | (collector.go:161 in runCheck) | Finish container check #4200 in 4.243073ms                                                                                                                                                                                   |
| 1603746305218 | 2020-10-26 21:05:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746308234 | 2020-10-26 21:05:08 UTC | CORE | INFO | (pkg/serializer/serializer.go:356 in sendMetadata) | Sent metadata payload, size (raw/compressed): 467/287 bytes.                                                                                                                                                |
| 1603746308250 | 2020-10-26 21:05:08 UTC | CORE | INFO | (pkg/serializer/serializer.go:376 in SendJSONToV1Intake) | Sent processes metadata payload, size: 392 bytes.                                                                                                                                                     |
| 1603746354953 | 2020-10-26 21:05:54 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received                                                                                                                                                                                                    |
| 1603746365218 | 2020-10-26 21:06:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746424952 | 2020-10-26 21:07:04 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received                                                                                                                                                                                                    |
| 1603746425218 | 2020-10-26 21:07:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746485218 | 2020-10-26 21:08:05 UTC | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues: process[size=0, weight=0], pod[size=0, weight=0]                                                                                                                                                               |
| 1603746485758 | 2020-10-26 21:08:05 UTC | CORE | ERROR | (pkg/autodiscovery/listeners/ecs.go:117 in refreshServices) | failed to get task metadata, not refreshing services - Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                             |
| 1603746487758 | 2020-10-26 21:08:07 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider ecs: Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                        |
| 1603746487758 | 2020-10-26 21:08:07 UTC | CORE | ERROR | (pkg/autodiscovery/listeners/ecs.go:117 in refreshServices) | failed to get task metadata, not refreshing services - Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                             |
| 1603746488833 | 2020-10-26 21:08:08 UTC | TRACE | WARN | (pkg/tagger/tagger.go:211 in pull) | Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                                                                                             |
| 1603746488856 | 2020-10-26 21:08:08 UTC | CORE | WARN | (pkg/tagger/tagger.go:211 in pull) | Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                                                                                              |
| 1603746488875 | 2020-10-26 21:08:08 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:118 in LogMessage) | ecs_fargate:d884b5186b651429 | (ecs_fargate.py:90) | Error fetching Fargate http://169.254.170.2/v2/metadata endpoint                                                                               |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn                                                                                                                                                                                           |
| 1603746488875 |     conn = connection.create_connection(                                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection                                                                                                                                                                               |
| 1603746488875 |     raise err                                                                                                                                                                                                                                                                                            |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection                                                                                                                                                                               |
| 1603746488875 |     sock.connect(sa)                                                                                                                                                                                                                                                                                     |
| 1603746488875 | ConnectionRefusedError: [Errno 111] Connection refused                                                                                                                                                                                                                                                   |
| 1603746488875 | During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                      |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen                                                                                                                                                                                         |
| 1603746488875 |     httplib_response = self._make_request(                                                                                                                                                                                                                                                               |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request                                                                                                                                                                                   |
| 1603746488875 |     conn.request(method, url, **httplib_request_kw)                                                                                                                                                                                                                                                      |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1255, in request                                                                                                                                                                                                                 |
| 1603746488875 |     self._send_request(method, url, body, headers, encode_chunked)                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1301, in _send_request                                                                                                                                                                                                           |
| 1603746488875 |     self.endheaders(body, encode_chunked=encode_chunked)                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1250, in endheaders                                                                                                                                                                                                              |
| 1603746488875 |     self._send_output(message_body, encode_chunked=encode_chunked)                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1010, in _send_output                                                                                                                                                                                                            |
| 1603746488875 |     self.send(msg)                                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 950, in send                                                                                                                                                                                                                     |
| 1603746488875 |     self.connect()                                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect                                                                                                                                                                                             |
| 1603746488875 |     conn = self._new_conn()                                                                                                                                                                                                                                                                              |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn                                                                                                                                                                                           |
| 1603746488875 |     raise NewConnectionError(                                                                                                                                                                                                                                                                            |
| 1603746488875 | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f97799bac70>: Failed to establish a new connection: [Errno 111] Connection refused                                                                                                                                |
| 1603746488875 | During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                      |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send                                                                                                                                                                                                 |
| 1603746488875 |     resp = conn.urlopen(                                                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen                                                                                                                                                                                         |
| 1603746488875 |     retries = retries.increment(                                                                                                                                                                                                                                                                         |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 446, in increment                                                                                                                                                                                           |
| 1603746488875 |     raise MaxRetryError(_pool, url, error or ResponseError(cause))                                                                                                                                                                                                                                       |
| 1603746488875 | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='169.254.170.2', port=80): Max retries exceeded with url: /v2/metadata (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f97799bac70>: Failed to establish a new connection: [Errno 111] Connection refused'))    |
| 1603746488875 | During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                      |
| 1603746488875 | Traceback (most recent call last):                                                                                                                                                                                                                                                                       |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ecs_fargate/ecs_fargate.py", line 79, in check                                                                                                                                                                            |
| 1603746488875 |     request = self.http.get(metadata_endpoint)                                                                                                                                                                                                                                                           |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 283, in get                                                                                                                                                                                     |
| 1603746488875 |     return self._request('get', url, options)                                                                                                                                                                                                                                                            |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 332, in _request                                                                                                                                                                                |
| 1603746488875 |     return getattr(requests, method)(url, **new_options)                                                                                                                                                                                                                                                 |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get                                                                                                                                                                                                        |
| 1603746488875 |     return request('get', url, params=params, **kwargs)                                                                                                                                                                                                                                                  |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request                                                                                                                                                                                                    |
| 1603746488875 |     return session.request(method=method, url=url, **kwargs)                                                                                                                                                                                                                                             |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request                                                                                                                                                                                              |
| 1603746488875 |     resp = self.send(prep, **send_kwargs)                                                                                                                                                                                                                                                                |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send                                                                                                                                                                                                 |
| 1603746488875 |     r = adapter.send(request, **kwargs)                                                                                                                                                                                                                                                                  |
| 1603746488875 |   File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send                                                                                                                                                                                                 |
| 1603746488875 |     raise ConnectionError(e, request=request)                                                                                                                                                                                                                                                            |
| 1603746488875 | requests.exceptions.ConnectionError: HTTPConnectionPool(host='169.254.170.2', port=80): Max retries exceeded with url: /v2/metadata (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f97799bac70>: Failed to establish a new connection: [Errno 111] Connection refused')) |
| 1603746488940 | 2020-10-26 21:08:08 UTC | PROCESS | WARN | (pkg/tagger/tagger.go:211 in pull) | Get "http://169.254.170.2/v2/metadata": dial tcp 169.254.170.2:80: connect: connection refused                                                                                                                           |
| 1603746489761 | panic: runtime error: slice bounds out of range [:12] with length 0                                                                                                                                                                                                                                      |
| 1603746489761 | goroutine 220 [running]:                                                                                                                                                                                                                                                                                 |
| 1603746489761 | github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).createService(0xc00079b2c0, 0xc000742250, 0x3, 0xc000836180, 0xc0008700a0, 0x47, 0x0, 0x0, 0x0, 0x0, ...)                                                                                                                    |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:207 +0x829                                                                                                                                                                                          |
| 1603746489761 | github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).refreshServices(0xc00079b2c0, 0xc000eb5f00)                                                                                                                                                                                  |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:146 +0x871                                                                                                                                                                                          |
| 1603746489761 | github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).Listen.func1(0xc00079b2c0)                                                                                                                                                                                                   |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:100 +0x114                                                                                                                                                                                          |
| 1603746489761 | created by github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*ECSListener).Listen                                                                                                                                                                                                            |
| 1603746489761 |  /.omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners/ecs.go:91 +0x5c                                                                                                                                                                                            |
| 1603746489766 | AGENT EXITED WITH CODE 2, SIGNAL 0, KILLING CONTAINER                                                                                                                                                                                                                                                    |
| 1603746489773 | 2020-10-26 21:08:09 UTC | TRACE | INFO | (main.go:23 in handleSignal) | received signal 15 (terminated)                                                                                                                                                                                                  |
| 1603746489773 | 2020-10-26 21:08:09 UTC | TRACE | INFO | (pkg/trace/agent/agent.go:132 in loop) | Exiting...                                                                                                                                                                                                             |
| 1603746489773 | 2020-10-26 21:08:09 UTC | TRACE | INFO | (pkg/trace/stats/concentrator.go:105 in Run) | Exiting concentrator, computing remaining stats                                                                                                                                                                  |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.                                                                                                                                                     |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:280 in Stop) | stopping the Forwarder                                                                                                                                                                                             |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | INFO | (pkg/forwarder/domain_forwarder.go:233 in Stop) | domainForwarder stopped                                                                                                                                                                                     |
| 1603746489773 | 2020-10-26 21:08:09 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:280 in Stop) | stopping the Forwarder                                                                                                                                                                                             |
| 1603746489773 | trace-agent exited with code 0, disabling                                                                                                                                                                                                                                                                |
| 1603746489775 | process-agent exited with code 0, disabling                                                                                                                                                                                                                                                              |
| 1603746489776 | [cont-finish.d] executing container finish scripts...                                                                                                                                                                                                                                                    |
| 1603746489777 | [cont-finish.d] done.                                                                                                                                                                                                                                                                                    |
| 1603746489777 | [s6-finish] waiting for services.                                                                                                                                                                                                                                                                        |
| 1603746489981 | [s6-finish] sending all processes the TERM signal.                                                                                                                                                                                                                                                       |
| 1603746492986 | [s6-finish] sending all processes the KILL signal and exiting.                                                                                                                                                                                                                                           |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

However, the defined health check still seems to be succeeding since the task is not restarted and the container is still marked as HEALTHY.

          ...
          HealthCheck:
            Command:
              - CMD-SHELL
              - /probe.sh
            Interval: 120
            Retries: 2
            Timeout: 5
          ...

From a short dive into the code, the problem seems to be here: https://github.com/DataDog/datadog-agent/blob/b1324f0c28d6095cfa0cfa072778009af1c36f4b/pkg/autodiscovery/listeners/ecs.go#L206-L208

There seem to be assumptions around c.DockerID that are not met.

Describe what you expected:

At the very least, the health check to fail.

Steps to reproduce the issue:

Unsure. This happens very sporadically, but the root cause seems to be a failed HTTP request to the metadata endpoint.

Additional environment details (Operating System, Cloud provider, etc):

Cloud Provider: AWS Orchestration: ECS Fargate Agent Image: datadog/agent:latest-jmx Environment Variables:

          Environment:
            - Name: ECS_FARGATE
              Value: "true"
            - Name: DD_TAGS
              Value: !Sub "accountid:${AWS::AccountId} env:${ServiceEnvironment} service_name:${ServiceName} service_version:${ImageTag}"
            - Name: DD_APM_ENABLED
              Value: "true"
            - Name: DD_JMXFETCH_ENABLED
              Value: "true"
            - Name: DD_HISTOGRAM_AGGREGATES
              Value: max median avg count sum
            # See: https://github.com/DataDog/datadog-agent/issues/3159
            - Name: DD_DOGSTATSD_TAG_CARDINALITY
              Value: orchestrator

Update (27.10.2020): Provided more comprehensive agent log.

Victormeriqui commented 3 years ago

+1

bitlejac commented 3 years ago

+1

alexmarqs commented 3 years ago

+1

kwuerl commented 3 years ago

+1

aweraw commented 3 years ago

+1

houqp commented 3 years ago

We are running into the same issue. Looks like ECS issue to me, should we add a graceful retry in datadog agent to workaround it?

peter-featherstone commented 3 years ago

Not sure if it's an ECS issue if the agent health check itself is still passing which is what I assume ECS would be using to determine if the container needs restarting? Having the same issue here.

houqp commented 3 years ago

yeah, we should also fail agent health check in case of this kind of unrecoverable error.

ravikashyap1 commented 3 years ago

Is there an update to this bug fix? Seeing the same in ECS Fargate.

lockwobr commented 3 years ago

I am seeing this same issue in AWS ECS Fargate, but seems like an s6-overlay issue. I have this agent as an essential sidecar to my application, (both containers are essential, which if either exits the task should exit) this error happens but it seems task does not exiting overall, which to me make me feel that maybe the s6-overlay is not allowing the container to exit when this error is happening, and thus my application does not exit either. Perhaps that is to be expected? To me that would be unexpected behavior.

Simwar commented 3 years ago

Hey all,

The issue has been fixed with this PR: https://github.com/DataDog/datadog-agent/pull/6938 and is in the version 7.25 of the Datadog Agent.

Thanks!