Missing Metrics - Githubissues

gkramer commented 2 years ago

Hey guys,

Wondering if someone could assist with an issue I'm having with BigGraphite [BG]. It currently receives a large number of metrics, but appears to drop a noticable proportion randomly... this was highlighted when looking at metrics from Apache Spark, which has frequent gaps per hour (of one minute each).

Infrastructure Setup:

Within EKS (1.20)
internal AWS NLB
Traffic Flow: NLB -> Carbon Container -> {elasticsearch + cassandra}
Carbon: Running inside an upstream Alpine container
PS: 1 root 0:00 {entrypoint} /bin/sh /entrypoint 49 root 0:00 runsvdir -P /etc/service 51 root 0:00 runsv bg-carbon 52 root 0:03 runsv brubeck 53 root 0:00 runsv carbon 54 root 0:00 runsv carbon-aggregator 55 root 0:03 runsv carbon-relay 56 root 0:03 runsv collectd 57 root 0:00 runsv cron 58 root 0:00 runsv go-carbon 59 root 0:00 runsv graphite 60 root 0:00 runsv nginx 61 root 0:03 runsv redis 62 root 0:00 runsv statsd 63 root 0:00 tee -a /var/log/carbon.log 65 root 0:00 tee -a /var/log/carbon-relay.log 68 root 0:00 tee -a /var/log/statsd.log 69 root 0:01 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 70 root 0:09 {node} statsd /opt/statsd/config/tcp.js 71 root 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf 76 root 0:00 /usr/sbin/crond -f 79 nginx 0:00 nginx: worker process 80 nginx 0:00 nginx: worker process 81 nginx 0:00 nginx: worker process 82 nginx 0:00 nginx: worker process 85 root 0:35 tee -a /var/log/bg-carbon.log 86 root 45:27 /opt/graphite/bin/python3 /opt/graphite/bin/bg-carbon-cache start --nodaemon --debug 88 root 0:00 tee -a /var/log/carbon-aggregator.log 156 root 0:41 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 157 root 0:49 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 158 root 0:46 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 159 root 0:47 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0

I can see traffic coming in to the interface (tcpdump/tcpflow), and can see logs to bg-carbon.log with references to 'cache query', but almost no datapoint logs for spark metrics.

Any assistance in troubleshooting would be greatly appreciated!

geobeau commented 2 years ago

If you look on the Cassandra side:

do you have errors?
do you see a drop in write ops when you notice the drops?

Inside your container, does carbon restarts by itself?

gkramer commented 2 years ago

Apologies for the delay in coming back to you!

I've rebuilt the cache container to only run carbon cache. Previously, it was running statds+carbon+etc, and this was all under supervisord, or similar. The container now runs carbon exclusively.

At first, and under low load, there were no metric drop-outs at all. We were shipping all metrics for spark, and it was bulletproof. As soon as we started shipping more metrics from other services, we began to see drop-outs of 1-2 minutes. across multiple metrics. Another interesting observation is that metrics appear to disappear at times - I'm not sure if they are being overwritten by null values? What I can tell you is that metrics are being fed into now what is a dedicated carbon ingress, and being inspected from another graphite endpoint, so whisper data is not a thing.

I've made multiple tweaks to the configs, but I'm at a bit of a loss as to how to eradicate the intermittent data loss.

Any help would be GREATLY appreciated!

TIA!

criteo / biggraphite

Missing Metrics #580