TCollector terminating collectors after "inactivity", but outputs metrics on command line as same user (HadoopHttp class should flush stdout after each emit to avoid few/single metrics being held in buffer for subclassed programs) #398
Hit a maddening issue where collectors I wrote that used to work seems to stop sending metrics to OpenTSDB, with TCollector complaining of no activity and killing them every 10 mins, even though testing them on the command line with the same user showed them outputting metrics.
I had subclassed HadoopHttp to get G1GC duration young + old gen metrics I need for HBase cluster tuning feedback (quick workaround to #393) and ended up with a maddening situation where the collectors worked initially for a few days over the weekend, then stopped working after a second rolling restart on Tuesday, then worked intermittently but only on a subset of hosts even though all the MD5s and everything else lined up (was deployed from Git via Ansible so they all had identical deployments).
It turns out this is because the subclassed collectors emitted too few metrics which were staying in buffer and not getting flushed. The reason it worked initially and not after I started applying a couple improvements to the HBase cluster with rolling restart was because I wasn't having as many GCs so nothing was being returned for old gen GC which was null, reducing the amount of output for the collector to only young gen and not filling the buffer to spill.
The fix is to add
sys.stdout.flush()
after each emit. After I made this change to my collectors everything started working again.
This would be best done in the HadoopHttp library to not catch out any other subclassed programs.
Also, should probably add this in a utility function emit() in collectors/lib/utils.py which implicitly flushes stdout, and encourage all collectors to use it, just in case any given collector doesn't emit enough metrics to cause the buffer to spill within the 10 mins before TCollector decides to kill and restart the collector.
Hit a maddening issue where collectors I wrote that used to work seems to stop sending metrics to OpenTSDB, with TCollector complaining of no activity and killing them every 10 mins, even though testing them on the command line with the same user showed them outputting metrics.
I had subclassed HadoopHttp to get G1GC duration young + old gen metrics I need for HBase cluster tuning feedback (quick workaround to #393) and ended up with a maddening situation where the collectors worked initially for a few days over the weekend, then stopped working after a second rolling restart on Tuesday, then worked intermittently but only on a subset of hosts even though all the MD5s and everything else lined up (was deployed from Git via Ansible so they all had identical deployments).
It turns out this is because the subclassed collectors emitted too few metrics which were staying in buffer and not getting flushed. The reason it worked initially and not after I started applying a couple improvements to the HBase cluster with rolling restart was because I wasn't having as many GCs so nothing was being returned for old gen GC which was null, reducing the amount of output for the collector to only young gen and not filling the buffer to spill.
The fix is to add
after each emit. After I made this change to my collectors everything started working again.
This would be best done in the HadoopHttp library to not catch out any other subclassed programs.
Also, should probably add this in a utility function emit() in collectors/lib/utils.py which implicitly flushes stdout, and encourage all collectors to use it, just in case any given collector doesn't emit enough metrics to cause the buffer to spill within the 10 mins before TCollector decides to kill and restart the collector.