SamSaffron / graphite_docker

docker container for graphite
166 stars 78 forks source link

Missing counter data #4

Closed heyman closed 9 years ago

heyman commented 9 years ago

Hi!

I've been using this image in production for a couple of weeks now, gathering various metrics on the health of my system. I'm sending data to StatsD over UDP, and I have a Grafana dashboard where I view the data. It's been working great with no issues until yesterday, when I suddenly started to get "gaps" in my counter metrics.

Here's a couple of screenshots that shows graphs over a few different metrics for the same timespan:

image

image

image

image

In the first two graphs - whose metric type are counters - there are two large gaps. However, the third and fourth graph - whose metrics are timer and gauge - does not have any gaps.

Does anyone have any idea of what might be going on?

What I've checked so far:

SamSaffron commented 9 years ago

I think gauge and timers fill up gaps which, this very much looks like comms are going down somehow between the container and reporter ... maybe try running a simulation script locally to see if it works during these outages?

heyman commented 9 years ago

Whenever I do a deploy I increment a counter, and I've set up Grafana to display annotations in the graphs according to this counter. These deploy annotations gets properly reported in the graphs during the outages (see the blue veritcal line in screenshot 3 & 4 below), so it seems unlikely to be a communications error between the reporter and the container.

However, I've noticed another weird thing, which is that during the outages, I can see data for the last 10 second period (It's only noticeable if I select a time range that makes the X-axis steps small enough). Here are two screenshots, taken ~2 minutes after each other, that shows this (the red vertical line is at the same point in time in both graphs).

image

image

And here are two screenshots that shows the same thing for the stats.statsd.metrics_received graph.

image

image

So it seems that somehow the data for the last reported time period is retrievable, but the data isn't persisted once another report comes in?

heyman commented 9 years ago

Hi again!

I believe I might have found what was causing the issue. I found stray processes from an old container that hadn't been properly killed (due to the following docker issue: https://github.com/docker/docker/issues/12738).

My current theory is that one of the old statsd/graphite/carbon processes were sometimes overwriting the last reported metrics with 0 values.

I've just recently killed the stray processes, and so far it looks good. So with the risk of counting my chickens before they hatch, I'm closing this issue. Sorry for wasting your time with a non-related issue. Hopefully it can help someone else who might find this page through Google.

SamSaffron commented 9 years ago

no worries at all, glad you found the issue.

On Thu, May 7, 2015 at 2:24 AM, Jonatan Heyman notifications@github.com wrote:

Closed #4 https://github.com/SamSaffron/graphite_docker/issues/4.

— Reply to this email directly or view it on GitHub https://github.com/SamSaffron/graphite_docker/issues/4#event-298396962.