TSDB inserting spikes for missing polls

mitchell-es commented 7 years ago

This is one of many examples we are seeing of spikes we are seeing in the data. A glance at the code makes me think that it is supposed to handle this gracefully but the results say it is not. Specifically note that the fourth sample is invalid/zero but TSDB is generating an aggregate value using a delta V based on the zero anyway.

./debug_over_line_rate.py bois-cr1/FastPollHC/ifHCInOctets/xe-7_0_0.1506
Wed Aug  2 00:52:58 2017 1501635178 Counter64: [1501635150/0x1: 300290824551] 300290824551 Aggregate: [1501635150/0x1: average=12.2667 delta=368 min=nan max=nan]

Wed Aug  2 00:53:28 2017 1501635208 Counter64: [1501635180/0x1: 300290824919] 368 Aggregate: [1501635180/0x1: average=4.26667 delta=128 min=nan max=nan]

Wed Aug  2 00:53:58 2017 1501635238 Counter64: [1501635210/0x1: 300290825047] 128 Aggregate: [1501635210/0x1: average=5.00485e+09 delta=1.50145e+11 min=nan max=nan]

Wed Aug  2 00:54:28 2017 1501635268 Counter64: [1501635268/0x0: 0] -300290825047 Aggregate: [1501635240/0x1: average=5.00485e+09 delta=1.50145e+11 min=nan max=nan]

Wed Aug  2 00:54:58 2017 1501635298 Counter64: [1501635270/0x1: 300290814799] 300290814799 Aggregate: [1501635270/0x1: average=7.53333 delta=226 min=nan max=nan]

Wed Aug  2 00:55:28 2017 1501635328 Counter64: [1501635305/0x1: 300290815063] 264 Aggregate: [1501635300/0x1: average=3.23333 delta=97 min=nan max=nan]

mitchell-es commented 7 years ago

I should add that this is not the same issue as handling the decreasing counters we see sometimes from the Junipers. This is an issue I found while investigating other 'over line rate' messages we are seeing logged.

mitchell-es commented 7 years ago

Here's a second example:

(esxsnmp-legacy)[mitchell@snmp-staging1 /ssd/esxsnmp/data]$ ~/esxsnmp-legacy/util/debug_over_line_rate.py snll-mr2/FastPollHC/ifHCInOctets/xe-0_2_0.919 1502772063
/ssd/esxsnmp/data
Tue Aug 15 04:39:03 2017 1502771943 Counter64: [1502771948/0x1: 468678771371] 468678771371 Aggregate: [1502771940/0x1: average=21542.3 delta=646270 min=nan max=nan]
Tue Aug 15 04:39:33 2017 1502771973 Counter64: [1502771978/0x1: 468679465728] 694357 Aggregate: [1502771970/0x1: average=19061.6 delta=571847 min=nan max=nan]
Tue Aug 15 04:40:03 2017 1502772003 Counter64: [1502772008/0x1: 468679993027] 527299 Aggregate: [1502772000/0x1: average=22387.4 delta=671623 min=nan max=nan]
Tue Aug 15 04:40:33 2017 1502772033 Counter64: [1502772038/0x1: 468680717131] 724104 Aggregate: [1502772030/0x1: average=18944.3 delta=568329 min=nan max=nan]
Tue Aug 15 04:41:03 2017 1502772063 Counter64: [1502772070/0x1: 468681262927] 545796 Aggregate: [1502772060/0x1: average=5.12215e+09 delta=1.53665e+11 min=nan max=nan]
Tue Aug 15 04:41:33 2017 1502772093 Counter64: [1502772093/0x0: 0] -468681262927 Aggregate: [1502772090/0x1: average=7.68322e+09 delta=2.30497e+11 min=nan max=nan]
Tue Aug 15 04:42:03 2017 1502772123 Counter64: [1502772131/0x1: 468676347867] 468676347867 Aggregate: [1502772120/0x1: average=2.8172e+09 delta=8.45159e+10 min=nan max=nan]
Tue Aug 15 04:42:33 2017 1502772153 Counter64: [1502772161/0x1: 468677057661] 709794 Aggregate: [1502772150/0x1: average=21018.5 delta=630554 min=nan max=nan]
Tue Aug 15 04:43:03 2017 1502772183 Counter64: [1502772191/0x1: 468677642339] 584678 Aggregate: [1502772180/0x1: average=20338.2 delta=610145 min=nan max=nan]
Tue Aug 15 04:43:33 2017 1502772213 Counter64: [1502772222/0x1: 468678288059] 645720 Aggregate: [1502772210/0x1: average=20808.5 delta=624255 min=nan max=nan]
Tue Aug 15 04:44:03 2017 1502772243 Counter64: [1502772253/0x1: 468678932685] 644626 Aggregate: [1502772240/0x1: average=20025.9 delta=600777 min=nan max=nan]
Tue Aug 15 04:44:33 2017 1502772273 Counter64: [1502772283/0x1: 468679515832] 583147 Aggregate: [1502772270/0x1: average=20022.8 delta=600684 min=nan max=nan]

jdugan1024 commented 7 years ago

So interestingly both of these are actually decreasing counters as well. There is a gap with a missing data point, but the counters themselves go down between samples:

From the first example

[1501635210/0x1: 300290825047] 
[1501635268/0x0: 0]
[1501635270/0x1: 300290814799] 

300290814799 - 300290825047 = -10248
(timestamps 1501635270 - 1501635210 = 60)

From the second example

[1502772070/0x1: 468681262927] 
[1502772093/0x0: 0] 
[1502772131/0x1: 468676347867] 
468676347867 -  468681262927 = -4915060
(timestamps 1502772131 - 1502772070 = 61)

So I'm going to implement something to deal with the case where the delta is negative

mitchell-es commented 7 years ago

I'm not sure how I missed that the counter was decreasing. Too bad I already closed the decreasing counter case with Juniper. I can revisit it though. Since the behavior seems a bit new with the missing sample and all I will probably try to get a new packet trace of it happening.

esnet / esxsnmp-tsdb

TSDB inserting spikes for missing polls #2

From the first example

From the second example