Open petedmarsh opened 7 years ago
Hey @petedmarsh! Thanks for the bug report!
I'm not sure what we should do offhand. The problem with a sigterm is that it could indicate a reasonable exit scenario in which nothing wrong happened. For instance, if I run it on my command line and ctrl-c, it should exit without error. However, in this case, I can see it being problematic. We're working hard to make the agent more resilient, so this is definitely something to look at.
Can you open up a case with support so that you can send us some more details about your environment? I understand if you don't want to share them here.
You're right though, it should restart and it's definitely a bug if it doesn't! Thanks a lot for your bug report!
What about setting autorestart=true in the supervisor config for each process? The processes would then always restart regardless of exit codes - if you wanted to disable a process then you could use supervisor to turn it off.
On 19 Aug 2017 8:54 pm, "Greg Meyer" notifications@github.com wrote:
Hey @petedmarsh https://github.com/petedmarsh! Thanks for the bug report!
I'm not sure what we should do offhand. The problem with a sigterm is that it could indicate a reasonable exit scenario in which nothing wrong happened. For instance, if I run it on my command line and ctrl-c, it should exit without error. However, in this case, I can see it being problematic. We're working hard to make the agent more resilient, so this is definitely something to look at.
Can you open up a case with support so that you can send us some more details about your environment? I understand if you don't want to share them here.
You're right though, it should restart and it's definitely a bug if it doesn't! Thanks a lot for your bug report!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-agent/issues/3471#issuecomment-323544335, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIxemC2w_22QJl67pSqZJAjSjvFqYyeks5sZz15gaJpZM4Ox4EV .
We don't always want it to restart. Sometimes it should fail to start, for example if you have a bad config file it shouldn't keep trying to restart it.
We'll keep you updated, this is an important issue and we'll work on trying to get it resolved. Resiliency is very important to us! 😄
Is there any update on this topic for Agent v 6?
I'm curious about updates as well. We have a datadog container running in ECS Fargate that does not restart after being killed due to OOM errors. Has any progress been made on this?
I've pieced the following together as best I could but I'm not particularly knowledgeable about system operation/management so please forgive me if I've made a mistake :)
Last night the datadogstatsd and forwarder processes on one of my machines were terminated and did not restart. That machine hit ~100% memory usage overnight (we had a bunch of other problems due to that too).
Looking at the supervisor config for these processes I noticed that
autorestart
andexitcodes
are not explicitly defined:This means autorestart will default to
unexpected
, with the exitcodes defaulting to0,2
(http://supervisord.org/configuration.html)Looking at the logs for the forwarder process I can see this:
And looking at the agent code
SIGTERM
is handled like so:As I understand it this will cause the process to quit with exit code 0 as no other code is specified, rather than
128 + SIGTERM
. As the exit code is 0 supervisord doesn't consider it an unexpected shutdown and so does not restart the process.As I said I'm not super knowledgeable about these things - if the above is true then should the process exit with
128 + SIGTERM
as the exit code, and if I'm wrong then would it be resonable to addautorestart=true
to the supervisor config for these processes? As far as I can tell you always want your datadog processeses to restart automatically unless you explicitly kill them.