Gaps in graphs for aggregated interfaces

oh01 commented 4 years ago

Hello, recently I've migrated from 4.9.8 to 5.0.5 (both were originaly pre-packaged OVF images). I've dumped whole database and imported that to new server with whisper files. Everything seems to be working apart from graphs on aggregated interfaces, there are always huge gaps (it's plotting for about 10 minutes and then there are hours of gaps). Graphs on aggregated interfaces looks like this: obrazek If it's a "single" interface, it's plotting correctly, both port activity and port metrics.

I've tried to troubleshoot it according to https://nav.uninett.no/doc/latest/faq/graph_gaps.html but with no luck. Do you think this could be a performance issue (but everything worked correctly on NAV 4 on same server) or how could I help identify the issue?

So far I've extended UDP cache to 16MB, there are no errors for ipdevpoll and the collection interval looks like this: 2020-04-03 07:58:19,566 [INFO schedule.netboxjobscheduler] [5minstats ComwareSwitch] 5minstats for ComwareSwitch completed in 0:00:33.535622. next run in 0:04:26.464419. 2020-04-03 08:03:26,715 [INFO schedule.netboxjobscheduler] [5minstats ComwareSwitch] 5minstats for ComwareSwitch completed in 0:00:40.668890. next run in 0:04:19.331135. 2020-04-03 08:08:20,931 [INFO schedule.netboxjobscheduler] [5minstats ComwareSwitch] 5minstats for ComwareSwitch completed in 0:00:34.870257. next run in 0:04:25.129785.

Carbon’s cache looks like this: nav

Thanks for any ideas on how to troubleshoot it further.

lunkwill42 commented 4 years ago

I'm assuming the timestamps on the log excerpt match the blank part at the right end of the graph. Can't see anything specifically obvious here.

What is the naming scheme of the aggregate interfaces on this device?
Any spooky errors in your carbon-cache logs? Are there any complaints about invalid metrics names or other problems with the delivered data?
Can you confirm which carbon-cache package is being used on your system? apt-cache policy graphite-carbon?

oh01 commented 4 years ago

Hello, yes, that log match the end of the graph.

It's always Bridge-AggregationXX where XX is a number (as shown in the first screenshot)
I've found only var/log/carbon/tagdb.log.1 and there is just 1 error from 2 days ago from a different switch (which has no aggregated links and is plotting correctly) which seems unrelated: 18/04/2020 13:05:59 :: Error tagging nav.devices.s-another_switch.ipdevpoll.1minstats.runtime: Error requesting http://127.0.0.1:8000/tags/tagMultiSeries: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
graphite-carbon: Installed: 1.1.5-0.1

Thanks for any further help.

lunkwill42 commented 4 years ago

I can't think of a reason why you would only have a problem with your aggregated ports only, unless

There is an actual device-specific problem with collecting the data from just the aggregated ports
There is something about the naming of the aggregated ports that cause invalid Graphite metric names to be generated.

I would suggest enabling debug logging for the subsystem that pushes data from ipdevpoll to a carbon backend, just to see what it actually transmits. Debug logging this part may cause your logs to grow really huge, so I might suggest that you only do to this while running ipdevpoll against a single affected device. An example:

Make a temporary copy of NAV's logging config: cp /etc/nav/logging.conf /tmp/logging.conf
Edit /tmp/logging.conf, in the [levels] section add nav.metrics.carbon = DEBUG.
Then:

export NAV_LOGGING_CONF=/tmp/logging.conf
ipdevpolld -J 5minstats -n AFFECTED-DEVICE

This process should now debug log every carbon metric sent to your configured backend. Look for mentions of your aggregated ports in the submitted stats. Maybe then we can figure out whether the problem is in NAV or in Graphite.

oh01 commented 4 years ago

I did just that and the log looks fine. Excerpt looks like this: ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInOctets', (1587532480.8052769, 41283407449781)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutOctets', (1587532480.8052769, 9838116181787)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInBroadcastPkts', (1587532480.8052769, 136485405)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutBroadcastPkts', (1587532480.8052769, 757396566)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInMulticastPkts', (1587532480.8052769, 20124141)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutMulticastPkts', (1587532480.8052769, 167399694)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInErrors', (1587532480.8052769, 0)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutErrors', (1587532480.8052769, 0)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInUcastPkts', (1587532480.8052769, 1906169055)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutUcastPkts', (1587532480.8052769, 617627065)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutDiscards', (1587532480.8052769, 0))

Whole log is here: https://pastebin.com/dY9zh1r8

I've change the name of the switch in the log but it follows the same naming convention (no special characters, only alfanumeric, dash and underscore)

lunkwill42 commented 4 years ago

Can't seen anything wrong with that output, so that leaves interval or schema problems.

I'm assuming you did use ´whisper-info` to confirm the interval schema of the whisper files corresponding to the aggregate interfaces?

oh01 commented 4 years ago

Eventhough I used the provided OVF image where I didn't touch the preconfigured interval, I've just checked and it looks correct. For Bridge-Aggregation1/ifInOctets.wsp It looks like this:

aggregationMethod: last
maxRetention: 51840000
xFilesFactor: 0.5
fileSize: 45568

Archive 0
offset: 64
secondsPerPoint: 300
points: 2016
retention: 604800
size: 24192

Archive 1
offset: 24256
secondsPerPoint: 1800
points: 576
retention: 1036800
size: 6912

Archive 2
offset: 31168
secondsPerPoint: 7200
points: 600
retention: 4320000
size: 7200

Archive 3
offset: 38368
secondsPerPoint: 86400
points: 600
retention: 51840000
size: 7200

lunkwill42 commented 4 years ago

@oh01 apologies for losing this thread.. the schema looks fine. Unfortunately, your pastebin has expired - but if you could use the same method and track what timestamps and data values ipdevpoll is actually posting for, say nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInOctets over time, like for at least an hours, and try to correlate that with missing or present traffic data in the graph.

This would at least show whether ipdevpoll isn't producing the data, or carbon isn't receiving it/storing it.

And another question: So the octet counters for the aggregated ports have gaps. How about packet counters and other counters for the same port, are they also gappy?

oh01 commented 4 years ago

@lunkwill42 Thanks for getting back at me, I'll try it in near future and let you know the result.

The issue is with all counters on aggregated links and the gaps occur at the same time across all (aggregated) interfaces.

lunkwill42 commented 4 years ago

I still have no idea what's going on here, unfortunately. Closing the issue due to lack of feedback, but please feel free to add more comments/reopen if you get to do more debugging.

Uninett / nav

Gaps in graphs for aggregated interfaces #2147