Open neil-coles-m3 opened 5 years ago
Hi @neil-coles-m3
From the logs, it looks like Mesos killed the container.
Are there any logs on the Marathon side to know why it happened?
What was the error with the the key: {"key": "name","value": "datadog-agent"}
?
It is here to set the name of the container.
Hi @Simwar
I was getting the attached error unless I removed the name:
Hi @neil-coles-m3 Apparently, Mesos removed the possibility to specify the name as it introduced an issue. More info here: https://jira.apache.org/jira/browse/MESOS-8497 Although, it does not explain the issue you are having. Any chance we have more details from Mesos itself on why it kills the container? What are the limits you are setting? Do you see Docker events in the Datadog UI about OOMs? It might also be another explanation here.
Mesos is killing it because it fails healthcheck:
"message": "Task was killed since health check failed. Reason: StreamTcpException: Tcp command [Connect(172.22.3.147:9001,None,List(),Some(10 seconds),true)] failed because of Connection refused",
"state": "TASK_KILLED",
Hi @neil-coles-m3 It seems like our instructions were not updated correctly. The healtcheck on port 9001 was for agent 5. Our doc was updated to agent 6 that doesn't run supervisord in the container anymore (that's the component that was listening on port 9001 previously), but didn't update the healthcheck part.
We'll update the docs, in the meantime could you try replacing the health part of the marathon task with something similar to what we do in k8s for health checking? https://github.com/DataDog/datadog-agent/blob/f4bd71fc62b3a5c74eedd0b127634758a8caa2dd/Dockerfiles/manifests/agent.yaml#L41-L49
You can also remove {"containerPort": 9001,"hostPort": 9001,"servicePort": 10001,"protocol": "tcp","labels": {}}
from portMappings
so would it be running on port 5555 now? Guess I need to add a port mapping for that instead
Yeah the health port is 5555, do you know whether the healthcheck in mesos is executed from inside the task network namespace or from the outside? I can't remember on the top of my head, but can dig a bit if you don't recall either. If it's from the inside you won't need to declare 5555 in the port mapping.
After following instructions here: https://docs.datadoghq.com/integrations/mesos/#marathon
To start with it failed to deploy because of the
{"key": "name","value": "datadog-agent"},
After removing this key, it deployed but containers stayed unhealthy and exit after about 10mins.
Logs when exiting:
Describe what you expected: It to turn healthy and work
Additional environment details (Operating System, Cloud provider, etc): Mesos marathon without DC/OS