Open mitchell-es opened 7 years ago
I should add that this is not the same issue as handling the decreasing counters we see sometimes from the Junipers. This is an issue I found while investigating other 'over line rate' messages we are seeing logged.
Here's a second example:
(esxsnmp-legacy)[mitchell@snmp-staging1 /ssd/esxsnmp/data]$ ~/esxsnmp-legacy/util/debug_over_line_rate.py snll-mr2/FastPollHC/ifHCInOctets/xe-0_2_0.919 1502772063
/ssd/esxsnmp/data
Tue Aug 15 04:39:03 2017 1502771943 Counter64: [1502771948/0x1: 468678771371] 468678771371 Aggregate: [1502771940/0x1: average=21542.3 delta=646270 min=nan max=nan]
Tue Aug 15 04:39:33 2017 1502771973 Counter64: [1502771978/0x1: 468679465728] 694357 Aggregate: [1502771970/0x1: average=19061.6 delta=571847 min=nan max=nan]
Tue Aug 15 04:40:03 2017 1502772003 Counter64: [1502772008/0x1: 468679993027] 527299 Aggregate: [1502772000/0x1: average=22387.4 delta=671623 min=nan max=nan]
Tue Aug 15 04:40:33 2017 1502772033 Counter64: [1502772038/0x1: 468680717131] 724104 Aggregate: [1502772030/0x1: average=18944.3 delta=568329 min=nan max=nan]
Tue Aug 15 04:41:03 2017 1502772063 Counter64: [1502772070/0x1: 468681262927] 545796 Aggregate: [1502772060/0x1: average=5.12215e+09 delta=1.53665e+11 min=nan max=nan]
Tue Aug 15 04:41:33 2017 1502772093 Counter64: [1502772093/0x0: 0] -468681262927 Aggregate: [1502772090/0x1: average=7.68322e+09 delta=2.30497e+11 min=nan max=nan]
Tue Aug 15 04:42:03 2017 1502772123 Counter64: [1502772131/0x1: 468676347867] 468676347867 Aggregate: [1502772120/0x1: average=2.8172e+09 delta=8.45159e+10 min=nan max=nan]
Tue Aug 15 04:42:33 2017 1502772153 Counter64: [1502772161/0x1: 468677057661] 709794 Aggregate: [1502772150/0x1: average=21018.5 delta=630554 min=nan max=nan]
Tue Aug 15 04:43:03 2017 1502772183 Counter64: [1502772191/0x1: 468677642339] 584678 Aggregate: [1502772180/0x1: average=20338.2 delta=610145 min=nan max=nan]
Tue Aug 15 04:43:33 2017 1502772213 Counter64: [1502772222/0x1: 468678288059] 645720 Aggregate: [1502772210/0x1: average=20808.5 delta=624255 min=nan max=nan]
Tue Aug 15 04:44:03 2017 1502772243 Counter64: [1502772253/0x1: 468678932685] 644626 Aggregate: [1502772240/0x1: average=20025.9 delta=600777 min=nan max=nan]
Tue Aug 15 04:44:33 2017 1502772273 Counter64: [1502772283/0x1: 468679515832] 583147 Aggregate: [1502772270/0x1: average=20022.8 delta=600684 min=nan max=nan]
So interestingly both of these are actually decreasing counters as well. There is a gap with a missing data point, but the counters themselves go down between samples:
[1501635210/0x1: 300290825047]
[1501635268/0x0: 0]
[1501635270/0x1: 300290814799]
300290814799 - 300290825047 = -10248
(timestamps 1501635270 - 1501635210 = 60)
[1502772070/0x1: 468681262927]
[1502772093/0x0: 0]
[1502772131/0x1: 468676347867]
468676347867 - 468681262927 = -4915060
(timestamps 1502772131 - 1502772070 = 61)
So I'm going to implement something to deal with the case where the delta is negative
I'm not sure how I missed that the counter was decreasing. Too bad I already closed the decreasing counter case with Juniper. I can revisit it though. Since the behavior seems a bit new with the missing sample and all I will probably try to get a new packet trace of it happening.
This is one of many examples we are seeing of spikes we are seeing in the data. A glance at the code makes me think that it is supposed to handle this gracefully but the results say it is not. Specifically note that the fourth sample is invalid/zero but TSDB is generating an aggregate value using a delta V based on the zero anyway.