I do not have a lot of time to debug this, unfortunately, but I thought I should at least report our observations.
We've accidentally passed a false value to Metriks.gauge(key).set(value). Specifically, we've passed a "humanized size" string (e.g. "10KB") instead of the numerical value.
We've then seen all other metrics to stop reporting to Librato, too. In other words the false value was just one out of many metrics (most of the time meter(key).mark and timer(key).time). With the false value passed to gauge all metrics stopped being reported until the bug was fixed.
Also, we do not set an on_error callback, so we could not inspect this in the logs.
I suspect that a buggy value passed to the reporter might somehow clog the queue, and internally raise the same exception over and over?
I do not have a lot of time to debug this, unfortunately, but I thought I should at least report our observations.
We've accidentally passed a false value to
Metriks.gauge(key).set(value)
. Specifically, we've passed a "humanized size" string (e.g. "10KB") instead of the numerical value.We've then seen all other metrics to stop reporting to Librato, too. In other words the false value was just one out of many metrics (most of the time
meter(key).mark
andtimer(key).time
). With the false value passed togauge
all metrics stopped being reported until the bug was fixed.Also, we do not set an
on_error
callback, so we could not inspect this in the logs.I suspect that a buggy value passed to the reporter might somehow clog the queue, and internally raise the same exception over and over?