NetApp / harvest

Open-metrics endpoint for ONTAP and StorageGRID
https://netapp.github.io/harvest/latest
Apache License 2.0
147 stars 36 forks source link

Pollers die with unknown: os: process already finished #173

Closed florianmulatz closed 3 years ago

florianmulatz commented 3 years ago

Describe the bug Pollers die with unknown: os: process already finished and do not produce any data anymore.

Environment Harvest version: harvest version 21.05.1-1 (commit 2211c00) (build date 2021-05-21T01:28:12+0530) linux/amd64 Command line arguments used: bin/harvest status --config /opt/git-repos/config/harvest/harvest.yml OS: Red Hat Enterprise Linux release 8.3 (Ootpa) Install method: rpm-package ONTAP Version: NetApp Release 9.5 upwards until 9.8

To Reproduce Just let the pollers run some time

Expected behavior pollers should not die

Actual behavior pollers die with no log-entry and show "unknown: os: process already finished"

Possible solution, workaround, fix none found

Additional context harvest-status output:

[root@vihlv0230 harvest]# bin/harvest status --config /opt/git-repos/config/harvest/harvest.yml
Datacenter            Poller                PID        PromPort        Status
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
...
VIH                   vihsdnclu51           2042                       unknown: os: process already finished
KLU                   klusdnclu51           2832                       unknown: os: process already finished
...
+++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

last-log-entries from klusdnclu51 (filtered for errors and warnings)

[root@vihlv0230 harvest]# tail -n 100 /var/log/harvest/poller_klusdnclu51.log  | egrep "(error|warning)" | grep -v "not found in cache"
2021/06/15 00:14:40  (error  ) (collector) (ZapiPerf:FCVI): instance request: api request rejected => Counter collection is disabled
2021/06/15 00:19:40  (error  ) (collector) (ZapiPerf:FCVI): instance request: api request rejected => Counter collection is disabled
2021/06/15 00:23:36  (error  ) (collector) (Zapi:Subsystem): context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021/06/15 00:23:36  (warning) (collector) (Zapi:Status): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:36  (warning) (collector) (Zapi:SnapMirror): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:40  (warning) (collector) (ZapiPerf:LIF): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:40  (warning) (collector) (ZapiPerf:HeadroomAggr): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:40  (warning) (collector) (ZapiPerf:ISCSI): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:40  (warning) (collector) (ZapiPerf:WAFL): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:40  (warning) (collector) (ZapiPerf:FcpLif): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:40  (warning) (collector) (ZapiPerf:TokenManager): target unreachable, entering standby mode (retry to connect in 4 s)
2021/06/15 00:23:44  (warning) (collector) (ZapiPerf:TokenManager): lagging behind schedule 138.436µs
2021/06/15 00:23:44  (warning) (collector) (ZapiPerf:WAFL): lagging behind schedule 471.246µs
2021/06/15 00:23:44  (warning) (collector) (ZapiPerf:FcpLif): lagging behind schedule 186.054µs
2021/06/15 00:23:44  (warning) (collector) (ZapiPerf:ISCSI): lagging behind schedule 84.046µs
2021/06/15 00:23:44  (warning) (collector) (ZapiPerf:HeadroomAggr): lagging behind schedule 581.109µs
2021/06/15 00:23:44  (warning) (collector) (ZapiPerf:LIF): lagging behind schedule 318.103µs
2021/06/15 00:23:45  (warning) (collector) (Zapi:Status): lagging behind schedule 149.633µs
2021/06/15 00:24:40  (error  ) (collector) (ZapiPerf:FCVI): instance request: api request rejected => Counter collection is disabled

It looks like that the last one says that the api-request where rejected and it does not try it again. There are some log-entries at the same time in the audit-log of the filer - but they are not triggered by Harvest but from one of our OCUM-Servers

Jun 15 00:24:03 klusdnclu51-04.klu.infineon.com klusdnclu51-04: klusdnclu51-04: 0000001b.0026032d 02dd560a Tue Jun 15 2021 00:24:01 +02:00 [kern_audit:info:2036] 8503e80000747556 :: klusdnclu51:ontapi :: 10.136.104.160:61974 :: klusdnclu51-02:admin :: error-unknown-api :: Error: Unable to find API: ldap-check-get-iter on node vserver klusdnclu51-02
Jun 15 00:24:03 klusdnclu51-04.klu.infineon.com klusdnclu51-04: klusdnclu51-04: 0000001b.0026032f 02dd560a Tue Jun 15 2021 00:24:01 +02:00 [kern_audit:info:2036] 8503e80000747558 :: klusdnclu51:ontapi :: 10.136.104.160:61976 :: klusdnclu51-03:admin :: error-unknown-api :: Error: Unable to find API: ldap-check-get-iter on node vserver klusdnclu51-03
Jun 15 00:24:03 klusdnclu51-03.klu.infineon.com klusdnclu51-03: klusdnclu51-03: 0000001b.002a81d5 02dd22a3 Tue Jun 15 2021 00:24:01 +02:00 [kern_audit:info:2036] 8503e80000747551 :: klusdnclu51:ontapi :: 10.136.104.160:61969 :: klusdnclu51-01:admin :: error-unknown-api :: Error: Unable to find API: ldap-check-get-iter on node vserver klusdnclu51-01
Jun 15 00:24:03 klusdnclu51-02.klu.infineon.com klusdnclu51-02: klusdnclu51-02: 00000020.004b1a3f 159cb1e1 Tue Jun 15 2021 00:24:01 +02:00 [kern_audit:info:2334] 8503e8000074754d :: klusdnclu51:ontapi :: 10.136.104.160:61965 :: klusdnclu51-04:admin :: error-unknown-api :: Error: Unable to find API: ldap-check-get-iter on node vserver klusdnclu51-04

Unfortunately it is not "restartable" aynmore in this state.

It just does nothing when I try to do

bin/harvest restart klusdnclu51 --config /opt/git-repos/config/harvest/harvest.yml

and gives an error

can't verify status of [klusdnclu51]: kill poller and try again

a manual stop and a followed start produces the same error message.

Cheers

rahulguptajss commented 3 years ago

@florianmulatz we have released 21.05.2 few hours back. This version has enhanced logging and better handling around poller state. Could you try this version and see if we get more information in logs.

florianmulatz commented 3 years ago

@florianmulatz we have released 21.05.2 few hours back. This version has enhanced logging and better handling around poller state. Could you try this version and see if we get more information in logs.

Hey,

I've installed the new version now. I'll report tomorrow then if new insights. Cheers

florianmulatz commented 3 years ago

Can confirm now that my pollers do not die anymore with the new version.