DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.83k stars 1.19k forks source link

Data loss on stop #1547

Open allenluce opened 6 years ago

allenluce commented 6 years ago

Up to the last 15 seconds of aggregated data is lost when shutting down the statsd server (with statsd.Stop()). Even when using an aggregator with the flush interval quite low, some seconds of data don't end up getting pushed to the backend.

Is there a recommended way to flush data to prevent this from happening?

truthbk commented 6 years ago

@allenluce you're 100% right, it looks like statsd.Stop() here doesn't flush the aggregator with whatever it may contain at that point, the process shuts down without emptying those packets.

I don't believe we have a way around this at the moment. The flushes happen periodically as you already know, so depending on when during the flush interval you request the stop() you might lose 1s or 15s. We'd have to add some logic to the shutdown code. That would make the process teardown a little slower, but it does seem like the right thing to do. There are still things that can go wrong at the forwarder level... so we'd have to make this a best-effort thing.

We'll look into it. Thank you for bringing this up.

visciang commented 6 years ago

This issue is very annoying if you run the agent in a "side container" alongside a AWS Fargate Task (a short living "docker run"). When the main task ends, the agent container is stopped and it doesn't flush metrics / events / APM / etc.

The "side car" pattern only works for AWS Fargate Services (long living tasks).

As a workaround we currently deploy the bunch of agents as a AWS Fargate Services, used by Tasks to report datadog metrics.

baxang commented 3 years ago

Seems like https://github.com/DataDog/datadog-agent/pull/4129 addressed this issue.

sgnn7 commented 3 years ago

@allenluce / @visciang Can you try out the new version of the agent to see if this issue is resolved now?

miketheman commented 2 years ago

Seems similar to #3940