DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.86k stars 1.2k forks source link

Datadog marathon task exits after 10 minutes #4020

Open neil-coles-m3 opened 5 years ago

neil-coles-m3 commented 5 years ago

After following instructions here: https://docs.datadoghq.com/integrations/mesos/#marathon

To start with it failed to deploy because of the {"key": "name","value": "datadog-agent"},

After removing this key, it deployed but containers stayed unhealthy and exit after about 10mins.

Logs when exiting:

2019-08-15 10:42:41 UTC | PROCESS | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.
2019-08-15 10:42:41 UTC | CORE | INFO | (cmd/agent/app/run.go:79 in func2) | Received signal 'terminated', shutting down...
2019-08-15 10:42:41 UTC | CORE | INFO | (pkg/collector/runner/runner.go:151 in Stop) | Runner is shutting down...
2019-08-15 10:42:41 UTC | CORE | INFO | (pkg/collector/python/subprocesses.go:48 in TerminateRunningProcesses) | Canceling all running python subprocesses
2019-08-15 10:42:41 UTC | CORE | INFO | (pkg/forwarder/domain_forwarder.go:185 in Stop) | domainForwarder stopped
2019-08-15 10:42:41 UTC | CORE | INFO | (pkg/logs/logs.go:88 in Stop) | Stopping logs-agent
2019-08-15 10:42:41 UTC | CORE | INFO | (pkg/logs/logs.go:101 in Stop) | logs-agent stopped
2019-08-15 10:42:41 UTC | CORE | INFO | (cmd/agent/app/start.go:314 in StopAgent) | See ya!
AGENT EXITED WITH CODE 0, SIGNAL 0, KILLING CONTAINER
process-agent exited with code 0, disabling

Describe what you expected: It to turn healthy and work

Additional environment details (Operating System, Cloud provider, etc): Mesos marathon without DC/OS

Simwar commented 5 years ago

Hi @neil-coles-m3

From the logs, it looks like Mesos killed the container. Are there any logs on the Marathon side to know why it happened? What was the error with the the key: {"key": "name","value": "datadog-agent"}? It is here to set the name of the container.

neil-coles-m3 commented 5 years ago

Hi @Simwar

I was getting the attached error unless I removed the name:

Screenshot 2019-08-16 at 18 13 44

Simwar commented 5 years ago

Hi @neil-coles-m3 Apparently, Mesos removed the possibility to specify the name as it introduced an issue. More info here: https://jira.apache.org/jira/browse/MESOS-8497 Although, it does not explain the issue you are having. Any chance we have more details from Mesos itself on why it kills the container? What are the limits you are setting? Do you see Docker events in the Datadog UI about OOMs? It might also be another explanation here.

neil-coles-m3 commented 5 years ago

Mesos is killing it because it fails healthcheck:

 "message": "Task was killed since health check failed. Reason: StreamTcpException: Tcp command [Connect(172.22.3.147:9001,None,List(),Some(10 seconds),true)] failed because of Connection refused",
            "state": "TASK_KILLED",
hkaj commented 5 years ago

Hi @neil-coles-m3 It seems like our instructions were not updated correctly. The healtcheck on port 9001 was for agent 5. Our doc was updated to agent 6 that doesn't run supervisord in the container anymore (that's the component that was listening on port 9001 previously), but didn't update the healthcheck part.

We'll update the docs, in the meantime could you try replacing the health part of the marathon task with something similar to what we do in k8s for health checking? https://github.com/DataDog/datadog-agent/blob/f4bd71fc62b3a5c74eedd0b127634758a8caa2dd/Dockerfiles/manifests/agent.yaml#L41-L49

You can also remove {"containerPort": 9001,"hostPort": 9001,"servicePort": 10001,"protocol": "tcp","labels": {}} from portMappings

neil-coles-m3 commented 5 years ago

so would it be running on port 5555 now? Guess I need to add a port mapping for that instead

hkaj commented 5 years ago

Yeah the health port is 5555, do you know whether the healthcheck in mesos is executed from inside the task network namespace or from the outside? I can't remember on the top of my head, but can dig a bit if you don't recall either. If it's from the inside you won't need to declare 5555 in the port mapping.