bluecmd / fortigate_exporter

Prometheus exporter for Fortigate firewalls
GNU General Public License v3.0
241 stars 79 forks source link

Timeout when querying secondary unit in HA mode #208

Open p-v-a opened 1 year ago

p-v-a commented 1 year ago

I have observed this while troubleshooting fortigate_exporter timeouts, that happens sporadically across our fortigate fleet. normally scraping of one device would take 2-3 seconds. however in case of secondary unit in HA cluster, same scrape takes close to 25s. sporadically spilling over 30s default timeout.

It looks like the root cause it an API call to /api/v2/monitor/system/fortimanager/status?vdom=*. on a secondary that takes 10-20sec, and occasionally even longer.

I'm not sure if this is something you want to deal with, or it's more of a fortigate issue, however I'm creating issue here to document this behaviour.

aaronnad commented 1 year ago

+1, came here to say the same thing. Scraping devices that are not primary is causing issues with false reports of devices being down.

2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/fortimanager/status?vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/ha-statistics": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/interface/select?vdom=*&include_vlan=true&include_aggregate=true": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/link-monitor?vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/resource/usage?interval=1-min&scope=global": context canceled
2023/07/04 15:48:34 Warning: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/sensor-info?vdom=root": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/status": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/resource/usage?interval=1-min&vdom=*": context canceled
2023/07/04 15:48:34 Error: Get "https://fortigatehostname.my.tld/api/v2/monitor/system/ha-checksums?scope=global": context canceled
2023/07/04 15:48:34 Probe of "https://fortigatehostname.my.tld" failed, took 30.000 seconds
p-v-a commented 1 year ago

Yes, I ended up separating scrape jobs. one that scrape each fortigate host and includes only metrics that makes sense for individual box, like that:

      probes:
        include:
          - System/SensorInfo
          - System/Status
          - System/Time/Clock
          - System/Resource/Usage
          - License/Status
          - WebUI/State

and then second one that scrapes cluster VIP and excludes metrics above:

      probes:
        exclude:
          - System/SensorInfo
          - System/Status
          - System/Time/Clock
          - System/Resource/Usage
          - License/Status
          - WebUI/State