criteo / biggraphite

Simple Scalable Time Series Database
Apache License 2.0
129 stars 36 forks source link

Missing Metrics #580

Open gkramer opened 2 years ago

gkramer commented 2 years ago

Hey guys,

Wondering if someone could assist with an issue I'm having with BigGraphite [BG]. It currently receives a large number of metrics, but appears to drop a noticable proportion randomly... this was highlighted when looking at metrics from Apache Spark, which has frequent gaps per hour (of one minute each).

Infrastructure Setup:

I can see traffic coming in to the interface (tcpdump/tcpflow), and can see logs to bg-carbon.log with references to 'cache query', but almost no datapoint logs for spark metrics.

Any assistance in troubleshooting would be greatly appreciated!

geobeau commented 2 years ago

If you look on the Cassandra side:

Inside your container, does carbon restarts by itself?

gkramer commented 2 years ago

Apologies for the delay in coming back to you!

I've rebuilt the cache container to only run carbon cache. Previously, it was running statds+carbon+etc, and this was all under supervisord, or similar. The container now runs carbon exclusively.

At first, and under low load, there were no metric drop-outs at all. We were shipping all metrics for spark, and it was bulletproof. As soon as we started shipping more metrics from other services, we began to see drop-outs of 1-2 minutes. across multiple metrics. Another interesting observation is that metrics appear to disappear at times - I'm not sure if they are being overwritten by null values? What I can tell you is that metrics are being fed into now what is a dedicated carbon ingress, and being inspected from another graphite endpoint, so whisper data is not a thing.

I've made multiple tweaks to the configs, but I'm at a bit of a loss as to how to eradicate the intermittent data loss.

Any help would be GREATLY appreciated!

TIA!