Closed oh01 closed 4 years ago
I'm assuming the timestamps on the log excerpt match the blank part at the right end of the graph. Can't see anything specifically obvious here.
apt-cache policy graphite-carbon
?Hello, yes, that log match the end of the graph.
It's always Bridge-AggregationXX where XX is a number (as shown in the first screenshot)
I've found only var/log/carbon/tagdb.log.1
and there is just 1 error from 2 days ago from a different switch (which has no aggregated links and is plotting correctly) which seems unrelated:
18/04/2020 13:05:59 :: Error tagging nav.devices.s-another_switch.ipdevpoll.1minstats.runtime: Error requesting http://127.0.0.1:8000/tags/tagMultiSeries: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
graphite-carbon: Installed: 1.1.5-0.1
Thanks for any further help.
I can't think of a reason why you would only have a problem with your aggregated ports only, unless
I would suggest enabling debug logging for the subsystem that pushes data from ipdevpoll to a carbon backend, just to see what it actually transmits. Debug logging this part may cause your logs to grow really huge, so I might suggest that you only do to this while running ipdevpoll against a single affected device. An example:
cp /etc/nav/logging.conf /tmp/logging.conf
/tmp/logging.conf
, in the [levels]
section add nav.metrics.carbon = DEBUG
.export NAV_LOGGING_CONF=/tmp/logging.conf
ipdevpolld -J 5minstats -n AFFECTED-DEVICE
This process should now debug log every carbon metric sent to your configured backend. Look for mentions of your aggregated ports in the submitted stats. Maybe then we can figure out whether the problem is in NAV or in Graphite.
I did just that and the log looks fine. Excerpt looks like this:
('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInOctets', (1587532480.8052769, 41283407449781)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutOctets', (1587532480.8052769, 9838116181787)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInBroadcastPkts', (1587532480.8052769, 136485405)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutBroadcastPkts', (1587532480.8052769, 757396566)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInMulticastPkts', (1587532480.8052769, 20124141)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutMulticastPkts', (1587532480.8052769, 167399694)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInErrors', (1587532480.8052769, 0)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutErrors', (1587532480.8052769, 0)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInUcastPkts', (1587532480.8052769, 1906169055)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutUcastPkts', (1587532480.8052769, 617627065)), ('nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifOutDiscards', (1587532480.8052769, 0))
Whole log is here: https://pastebin.com/dY9zh1r8
I've change the name of the switch in the log but it follows the same naming convention (no special characters, only alfanumeric, dash and underscore)
Can't seen anything wrong with that output, so that leaves interval or schema problems.
I'm assuming you did use ´whisper-info` to confirm the interval schema of the whisper files corresponding to the aggregate interfaces?
Eventhough I used the provided OVF image where I didn't touch the preconfigured interval, I've just checked and it looks correct. For Bridge-Aggregation1/ifInOctets.wsp
It looks like this:
aggregationMethod: last
maxRetention: 51840000
xFilesFactor: 0.5
fileSize: 45568
Archive 0
offset: 64
secondsPerPoint: 300
points: 2016
retention: 604800
size: 24192
Archive 1
offset: 24256
secondsPerPoint: 1800
points: 576
retention: 1036800
size: 6912
Archive 2
offset: 31168
secondsPerPoint: 7200
points: 600
retention: 4320000
size: 7200
Archive 3
offset: 38368
secondsPerPoint: 86400
points: 600
retention: 51840000
size: 7200
@oh01 apologies for losing this thread.. the schema looks fine. Unfortunately, your pastebin has expired - but if you could use the same method and track what timestamps and data values ipdevpoll is actually posting for, say nav.devices.s-Comware_switch.ports.Bridge-Aggregation1.ifInOctets
over time, like for at least an hours, and try to correlate that with missing or present traffic data in the graph.
This would at least show whether ipdevpoll isn't producing the data, or carbon isn't receiving it/storing it.
And another question: So the octet counters for the aggregated ports have gaps. How about packet counters and other counters for the same port, are they also gappy?
@lunkwill42 Thanks for getting back at me, I'll try it in near future and let you know the result.
The issue is with all counters on aggregated links and the gaps occur at the same time across all (aggregated) interfaces.
I still have no idea what's going on here, unfortunately. Closing the issue due to lack of feedback, but please feel free to add more comments/reopen if you get to do more debugging.
Hello, recently I've migrated from 4.9.8 to 5.0.5 (both were originaly pre-packaged OVF images). I've dumped whole database and imported that to new server with whisper files. Everything seems to be working apart from graphs on aggregated interfaces, there are always huge gaps (it's plotting for about 10 minutes and then there are hours of gaps). Graphs on aggregated interfaces looks like this: If it's a "single" interface, it's plotting correctly, both port activity and port metrics.
I've tried to troubleshoot it according to https://nav.uninett.no/doc/latest/faq/graph_gaps.html but with no luck. Do you think this could be a performance issue (but everything worked correctly on NAV 4 on same server) or how could I help identify the issue?
So far I've extended UDP cache to 16MB, there are no errors for ipdevpoll and the collection interval looks like this:
2020-04-03 07:58:19,566 [INFO schedule.netboxjobscheduler] [5minstats ComwareSwitch] 5minstats for ComwareSwitch completed in 0:00:33.535622. next run in 0:04:26.464419. 2020-04-03 08:03:26,715 [INFO schedule.netboxjobscheduler] [5minstats ComwareSwitch] 5minstats for ComwareSwitch completed in 0:00:40.668890. next run in 0:04:19.331135. 2020-04-03 08:08:20,931 [INFO schedule.netboxjobscheduler] [5minstats ComwareSwitch] 5minstats for ComwareSwitch completed in 0:00:34.870257. next run in 0:04:25.129785.
Carbon’s cache looks like this:
Thanks for any ideas on how to troubleshoot it further.