Missing counter data - Githubissues

heyman commented 9 years ago

Hi!

I've been using this image in production for a couple of weeks now, gathering various metrics on the health of my system. I'm sending data to StatsD over UDP, and I have a Grafana dashboard where I view the data. It's been working great with no issues until yesterday, when I suddenly started to get "gaps" in my counter metrics.

Here's a couple of screenshots that shows graphs over a few different metrics for the same timespan:

In the first two graphs - whose metric type are counters - there are two large gaps. However, the third and fourth graph - whose metrics are timer and gauge - does not have any gaps.

Does anyone have any idea of what might be going on?

What I've checked so far:

Disk IO is low (An answer to http://serverfault.com/questions/533198/graphite-stops-collecting-data-randomly seems to suggest that one could see similar symptoms if an IOPS bottleneck is encountered)
Since it's still collecting timer and gauge data, it seems unlikely to be a network problem.
The following script reported no whisper file corruptions: https://gist.github.com/gonsfx/4111791

SamSaffron commented 9 years ago

I think gauge and timers fill up gaps which, this very much looks like comms are going down somehow between the container and reporter ... maybe try running a simulation script locally to see if it works during these outages?

heyman commented 9 years ago

Whenever I do a deploy I increment a counter, and I've set up Grafana to display annotations in the graphs according to this counter. These deploy annotations gets properly reported in the graphs during the outages (see the blue veritcal line in screenshot 3 & 4 below), so it seems unlikely to be a communications error between the reporter and the container.

However, I've noticed another weird thing, which is that during the outages, I can see data for the last 10 second period (It's only noticeable if I select a time range that makes the X-axis steps small enough). Here are two screenshots, taken ~2 minutes after each other, that shows this (the red vertical line is at the same point in time in both graphs).

And here are two screenshots that shows the same thing for the stats.statsd.metrics_received graph.

So it seems that somehow the data for the last reported time period is retrievable, but the data isn't persisted once another report comes in?

heyman commented 9 years ago

Hi again!

I believe I might have found what was causing the issue. I found stray processes from an old container that hadn't been properly killed (due to the following docker issue: https://github.com/docker/docker/issues/12738).

My current theory is that one of the old statsd/graphite/carbon processes were sometimes overwriting the last reported metrics with 0 values.

I've just recently killed the stray processes, and so far it looks good. So with the risk of counting my chickens before they hatch, I'm closing this issue. Sorry for wasting your time with a non-related issue. Hopefully it can help someone else who might find this page through Google.

SamSaffron commented 9 years ago

no worries at all, glad you found the issue.

On Thu, May 7, 2015 at 2:24 AM, Jonatan Heyman notifications@github.com wrote:

Closed #4 https://github.com/SamSaffron/graphite_docker/issues/4.

— Reply to this email directly or view it on GitHub https://github.com/SamSaffron/graphite_docker/issues/4#event-298396962.

SamSaffron / graphite_docker

Missing counter data #4