Open p-v-a opened 1 year ago
+1, came here to say the same thing. Scraping devices that are not primary is causing issues with false reports of devices being down.
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/fortimanager/status?vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/ha-statistics": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/interface/select?vdom=*&include_vlan=true&include_aggregate=true": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/link-monitor?vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/resource/usage?interval=1-min&scope=global": context canceled
2023/07/04 15:48:34 Warning: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/sensor-info?vdom=root": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/status": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/resource/usage?interval=1-min&vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/ha-checksums?scope=global": context canceled
2023/07/04 15:48:34 Probe of "https://fortigatehostname.my.tld" failed, took 30.000 seconds
Yes, I ended up separating scrape jobs. one that scrape each fortigate host and includes only metrics that makes sense for individual box, like that:
probes:
include:
- System/SensorInfo
- System/Status
- System/Time/Clock
- System/Resource/Usage
- License/Status
- WebUI/State
and then second one that scrapes cluster VIP and excludes metrics above:
probes:
exclude:
- System/SensorInfo
- System/Status
- System/Time/Clock
- System/Resource/Usage
- License/Status
- WebUI/State
I have observed this while troubleshooting fortigate_exporter timeouts, that happens sporadically across our fortigate fleet. normally scraping of one device would take 2-3 seconds. however in case of secondary unit in HA cluster, same scrape takes close to 25s. sporadically spilling over 30s default timeout.
It looks like the root cause it an API call to /api/v2/monitor/system/fortimanager/status?vdom=*. on a secondary that takes 10-20sec, and occasionally even longer.
I'm not sure if this is something you want to deal with, or it's more of a fortigate issue, however I'm creating issue here to document this behaviour.