DataDog / dd-agent

Datadog Agent Version 5
https://docs.datadoghq.com/
Other
1.3k stars 812 forks source link

Agent/datadogstatsd doesn't restart after being killed due to OOM #3471

Open petedmarsh opened 7 years ago

petedmarsh commented 7 years ago

I've pieced the following together as best I could but I'm not particularly knowledgeable about system operation/management so please forgive me if I've made a mistake :)

Last night the datadogstatsd and forwarder processes on one of my machines were terminated and did not restart. That machine hit ~100% memory usage overnight (we had a bunch of other problems due to that too).

Looking at the supervisor config for these processes I noticed that autorestart and exitcodes are not explicitly defined:

[program:dogstatsd]
command=/opt/datadog-agent/embedded/bin/python /opt/datadog-agent/agent/dogstatsd.py --use-local-forwarder
stdout_logfile=NONE
stderr_logfile=NONE
startsecs=5
startretries=3
priority=998
user=dd-agent

This means autorestart will default to unexpected, with the exitcodes defaulting to 0,2 (http://supervisord.org/configuration.html)

Looking at the logs for the forwarder process I can see this:

2017-08-08 23:40:37 UTC | INFO | dd.forwarder | forwarder(ddagent.py:571) | caught sigterm. stopping
2017-08-08 23:40:37 UTC | INFO | dd.forwarder | forwarder(ddagent.py:553) | Stopped

And looking at the agent code SIGTERM is handled like so:

    # https://github.com/DataDog/dd-agent/blob/master/ddagent.py#L592
    def sigterm_handler(signum, frame):
        log.info("caught sigterm. stopping")
        app.stop()

# which calls

     # https://github.com/DataDog/dd-agent/blob/master/ddagent.py#L577
    def stop(self):
        self.mloop.stop()

As I understand it this will cause the process to quit with exit code 0 as no other code is specified, rather than 128 + SIGTERM. As the exit code is 0 supervisord doesn't consider it an unexpected shutdown and so does not restart the process.

As I said I'm not super knowledgeable about these things - if the above is true then should the process exit with 128 + SIGTERM as the exit code, and if I'm wrong then would it be resonable to add autorestart=true to the supervisor config for these processes? As far as I can tell you always want your datadog processeses to restart automatically unless you explicitly kill them.

gmmeyer commented 7 years ago

Hey @petedmarsh! Thanks for the bug report!

I'm not sure what we should do offhand. The problem with a sigterm is that it could indicate a reasonable exit scenario in which nothing wrong happened. For instance, if I run it on my command line and ctrl-c, it should exit without error. However, in this case, I can see it being problematic. We're working hard to make the agent more resilient, so this is definitely something to look at.

Can you open up a case with support so that you can send us some more details about your environment? I understand if you don't want to share them here.

You're right though, it should restart and it's definitely a bug if it doesn't! Thanks a lot for your bug report!

petedmarsh commented 7 years ago

What about setting autorestart=true in the supervisor config for each process? The processes would then always restart regardless of exit codes - if you wanted to disable a process then you could use supervisor to turn it off.

On 19 Aug 2017 8:54 pm, "Greg Meyer" notifications@github.com wrote:

Hey @petedmarsh https://github.com/petedmarsh! Thanks for the bug report!

I'm not sure what we should do offhand. The problem with a sigterm is that it could indicate a reasonable exit scenario in which nothing wrong happened. For instance, if I run it on my command line and ctrl-c, it should exit without error. However, in this case, I can see it being problematic. We're working hard to make the agent more resilient, so this is definitely something to look at.

Can you open up a case with support so that you can send us some more details about your environment? I understand if you don't want to share them here.

You're right though, it should restart and it's definitely a bug if it doesn't! Thanks a lot for your bug report!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-agent/issues/3471#issuecomment-323544335, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIxemC2w_22QJl67pSqZJAjSjvFqYyeks5sZz15gaJpZM4Ox4EV .

gmmeyer commented 7 years ago

We don't always want it to restart. Sometimes it should fail to start, for example if you have a bad config file it shouldn't keep trying to restart it.

gmmeyer commented 7 years ago

We'll keep you updated, this is an important issue and we'll work on trying to get it resolved. Resiliency is very important to us! 😄

abeluck commented 6 years ago

Is there any update on this topic for Agent v 6?

rpdelaney commented 3 years ago

I'm curious about updates as well. We have a datadog container running in ECS Fargate that does not restart after being killed due to OOM errors. Has any progress been made on this?