svm_nfs_ops is reporting Billion IOPs for an SVM with 10 nodes.

NetApp / harvest

Open-metrics endpoint for ONTAP and StorageGRID

https://netapp.github.io/harvest/latest

Apache License 2.0

150 stars 37 forks source link

svm_nfs_ops is reporting Billion IOPs for an SVM with 10 nodes. #1206

Closed jmg011 closed 2 years ago

jmg011 commented 2 years ago

Noticed a billion IOPs for an SVM with 10 nodes with svm_nfs_ops metric. Sometimes it also shows half a billion negative IOPs.

Running Latest Major Release for the Harvest

bin/harvest version harvest version 22.05.0-1 (commit 2bc2942) (build date 2022-05-11T07:57:16-0400) linux/amd64

1 day timeseries for svm_nfs_ops metric on a single SVM with 10 nodes. The spikes are billion IOPs.

Can you help check if it is Harvest Bug? OCUM shows 1 Million IOPs for the same duration for SVM when Prometheus shows 1B IOPs.

rahulguptajss commented 2 years ago

@jmg011 There is a similiar issue reported about negative counters related to svm nfs v3 #762. Could you confirm the numbers reported in system manager if they are in millions or billions?

We have handled negative counters #1205 by changing negative counters to 0 for our upcoming release.

cgrinds commented 2 years ago

hi @jmg011 can you also share the ONTAP version and whether these are NFS v3, v4, or v4.1 shares?

jmg011 commented 2 years ago

@rahulguptajss When you say system manager you mean OCUM? Ocum reports in million. Prometheus scrapped Billion from Harvest exporter.

@cgrinds Version: NetApp Release 9.8P9 & NFS v3

cgrinds commented 2 years ago

Thanks for the ONTAP and NFS version information. Our suspicion is this is an ONTAP counter bug since we have a few customer reports of negative counters with NFS. The Harvest logic for handling NFS counters is the same as the other performance counters.

System Manager is the UI that can be used to manage a cluster.

E.g.

cgrinds commented 2 years ago

hi @jmg011

Any changes to your conf/zapiperf/cdot/9.8.0/nfsv3.yaml template that captures this metric?
Would it be possible to monitor this cluster from a separate poller capturing trace logs? Something like this:

bin/poller --promPort 19002 --poller $poller-name --collectors ZapiPerf --objects NFSv3 --loglevel 0 2>&1 | tee nfs.txt

Let that run for 30 minutes or so and then email the nfs.txt file to ng-harvest-files@netapp.com

jmg011 commented 2 years ago

@cgrinds No changes to the template. I will start the poller in my dev to capture logs and will send it to the ng-harvest-files@netapp.com today

cgrinds commented 2 years ago

Thanks again for the log files @jmg011; they were very helpful. We're working on some improvements in this area and will ping you when they made it through CI and integration tests.

rahulguptajss commented 2 years ago

This issue is now fixed in main branch. Solution is to skip any negative counters or spikes generated due to this kind of data.

cgrinds commented 2 years ago

hi @jmg011 when you get a chance, could you grab nightly and see if our latest fix address your billions problem? Thanks!

rahulguptajss commented 1 year ago

verified negative counter logic in 22.11