Timeout when querying secondary unit in HA mode

p-v-a commented 1 year ago

I have observed this while troubleshooting fortigate_exporter timeouts, that happens sporadically across our fortigate fleet. normally scraping of one device would take 2-3 seconds. however in case of secondary unit in HA cluster, same scrape takes close to 25s. sporadically spilling over 30s default timeout.

It looks like the root cause it an API call to /api/v2/monitor/system/fortimanager/status?vdom=*. on a secondary that takes 10-20sec, and occasionally even longer.

I'm not sure if this is something you want to deal with, or it's more of a fortigate issue, however I'm creating issue here to document this behaviour.

aaronnad commented 1 year ago

+1, came here to say the same thing. Scraping devices that are not primary is causing issues with false reports of devices being down.

2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/fortimanager/status?vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/ha-statistics": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/interface/select?vdom=*&include_vlan=true&include_aggregate=true": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/link-monitor?vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/resource/usage?interval=1-min&scope=global": context canceled
2023/07/04 15:48:34 Warning: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/sensor-info?vdom=root": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/status": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/resource/usage?interval=1-min&vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/ha-checksums?scope=global": context canceled
2023/07/04 15:48:34 Probe of "https://fortigatehostname.my.tld" failed, took 30.000 seconds

p-v-a commented 1 year ago

Yes, I ended up separating scrape jobs. one that scrape each fortigate host and includes only metrics that makes sense for individual box, like that:

      probes:
        include:
          - System/SensorInfo
          - System/Status
          - System/Time/Clock
          - System/Resource/Usage
          - License/Status
          - WebUI/State

and then second one that scrapes cluster VIP and excludes metrics above:

      probes:
        exclude:
          - System/SensorInfo
          - System/Status
          - System/Time/Clock
          - System/Resource/Usage
          - License/Status
          - WebUI/State

bluecmd / fortigate_exporter

Timeout when querying secondary unit in HA mode #208