fandrei / AppMetrics

Apache License 2.0
8 stars 2 forks source link

Identify whether our measurements are affected by the coordinated omission problem #136

Closed mrdavidlaing closed 11 years ago

mrdavidlaing commented 11 years ago

Full description of the issue at: http://labs.cityindex.com/labs-team/2013/03/13/the-coordinated-omission-problem/

This encompasses:

  1. Extending our reports to detect where our current measurements are affected
  2. Testing our measurements against a service which exhibits "pausing" behaviour, and ensuring that our reports reflect this pause
sopel commented 11 years ago

Sounds reasonable, nicely spotted! Would it be possible to include the maximum too in this chart, which is the most obvious indicator as per Gil Tene’s explanation?

mrdavidlaing commented 11 years ago

Same graph, but including max. (note that the line colors have changed) Capture

fandrei commented 11 years ago

Observe how as median latency goes up, the number of measurements goes down. Evidence of co-ordinated omission?

Not necessarily. Number of measurements can go down due to increased count of exceptions. (I'm afraid currently count of exceptions is not reported due to a bug) It's a common pattern that count of exceptions increases aside with increased latency when server problems happen.

Currently data are polled once a minute, and to affect our data significantly, max latency of a single request should be of a similar length. Very small count of requests takes longer than 10 second. I've found only 9 values higher than 30 second in our live latency data up to day. Thus, I don't think this effect currently affects our data significantly.

However, I think we can have another problem. With polling period of 1 minute, many short periods of latency degradation can just slip through our net.

mrdavidlaing commented 11 years ago
  1. You are right; the fact that our polling period of 1 min is much larger than our longest measured request (10 sec) means our data hasn't falled fowl of co-ordinated omission.
  2. Practically we're not going to be allowed to poll more frequently than 1 / min; so this is just data we can't gather.