Closed faguayot closed 12 months ago
@faguayot Could you please resend the log files? We were only able to extract one gz file named poller_sces1p1_03-2023-05-24T19-10-00.664.log, and it appears that the rest of the files are corrupted.
@rahulguptajss I've tried to attach again the files. Let me know if they are ok or you are still suffering the problem with the corruption.
@faguayot Thanks for sharing the log files again. We have received them and did initial investigation. Found few errors:
duplicate instancekey error
, @rahulguptajss has created issue for this and we will look into it.
{"level":"error","Poller":"sces1p3_03","plugin":"ZapiPerf:Volume","object":"Volume","error":"duplicate instance key => .","key":".","caller":"volume/volume.go:88","time":"2023-05-24T10:28:02+02:00","message":"Failed to create new instance"}
System busy errors
, These logs exist in both pollers and started to occur for poller xxxx_03_03 around 12:34(as you have mentioned). This logs are coming directly from ONTAP and Harvest show them as-is.
{"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:LIF","error":"API request rejected => System busy: 17 requests on table \"perf_object_get_instances\" have been pending for 2 seconds. The last completed call took 94 seconds.","task":"data","caller":"collector/collector.go:365","time":"2023-05-24T12:34:02+02:00","message":"API rejected, entering standby mode"}
{"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:NicCommon","error":"API request rejected => System busy: 17 requests on table \"perf_object_get_instances\" have been pending for 2 seconds. The last completed call took 94 seconds.","task":"data","caller":"collector/collector.go:365","time":"2023-05-24T12:34:02+02:00","message":"API rejected, entering standby mode"}
...
While searching System busy errors
in ONTAP side, found this burt 1387773 which has same error stack.
Sharing the description detail and Public Review mentioned in the burt:
DESCRIPTION:
When counter manager appears to be hung which usually manifests as a problem in
terms of gathering perf data or polling perf objects, following kind of messages
from mgwd would help in ascertaining that the issue is indeed in counter manager
daemon:
>>Attempt to get cm object resource_headroom_cpu failed:System busy: 7 requests on
table "perf_object_get_instances" have been pending for 13510621 seconds. The
last completed call took 0 seconds.
PR
------------------------
%%% TITLE: Blocked object and transport threads hamper counter manager's functionality
%%% DESCRIPTION:
Performance NetApp Manageability SDK (NMSDK) requests appear to get blocked after
hitting the storage controller with multiple performance NMSDKs. This blockage occurs
because the transport threads indefinitely waits on a mutex which, on the other
hand, is held by a dead thread.
%%% WORKAROUND:
To temporarily resolve the issue, restart of the counter manager daemon(CMD) process on
each node.
%%% NOTES:
None
--------------
NFSv4 related issues
, These logs exist in both pollers, Need to check if this is related to Counter Manager as mentioned above or the NFS object itself not exist in cluster - which can be valid scenario.
{"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:NFSv42Node","error":"API request rejected => Object \"nfsv4_2:node\" was not found.","task":"counter","caller":"collector/collector.go:365","time":"2023-05-24T11:08:09+02:00","message":"API rejected, entering standby mode"}
{"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:NFSv42","error":"API request rejected => Object \"nfsv4_2\" was not found.","task":"counter","caller":"collector/collector.go:365","time":"2023-05-24T11:08:09+02:00","message":"API rejected, entering standby mode"}
@faguayot
For issue 1 duplicate instancekey error
,
Could you please share the zapi response by invoking these commands and send them to our email address ?
./bin/harvest zapi -p sces1p3_03 show data --object volume
./bin/harvest zapi -p sces1p1_03 show data --object volume
Sure @Hardikl I will share the information using the email.
Thanks.
I catched a new moment where it was happening the issue in one cluster.
hi @faguayot if you have the logs for that poller, please email them to us so we can take a look.
Hardik did a nice job analyzing the last set of logs you sent and found a logging issue. Let me explain what the problem is and why it might be related to the gaps shown above.
The Harvest perf collector works like this:
The problem Hardik found in your logs occurs between 1 and 2. The instance we collected at step 1 was deleted by the time we got to step 2. That means at step 2, Harvest asks ONTAP for the counters of an instance that no longer exists. ONTAP returns an empty record when that happens, and Harvest was noisier about logging this than it should be. Hardik is working on a fix for that.
Is is possible that some of the gaps you're seeing on this cluster are because there are creates/deletes? Depending on the dashboard and whether the panel(s) are using topk or not, I think instance deletions could manifest as a gap because, on the left side of the gap, the instance was included in the top5, then it was deleted, and on the right side of the gap, a new instance was picked for the top5.
Hello @cgrinds,
It is great that you found the problem with the many information that you had to review. I understood your explanation chris, thanks for making really clear always. If I am right this issue is happening with every instance (Aggregate, Volume, Node, Lun,etc) that gather of performance metrics/counters, is it correct my affirmation?
I will wait until you have the fix to check if the behaviour it is happening again or not. Thanks.
Hi @faguayot
While @Hardikl works on a potential fix for this issue, can you email the most recent logs for one of your pollers with gaps?
Any idea if that cluster may have deletions during the time of the gaps? And yes, to your question, the instance removal logic is the same for all object types.
Hello @cgrinds I don't remember which cluster was this behaviour and I unknown during which period of time we have seen these cuts. Apologize but I didn't take note of this info. If I see the same behaviour in other times, I directly will share the logs and those info. I don't know what are you refering about the deletions? Where?
Thanks.
Hi @faguayot
Sounds good! I can't tell which dashboard panel you shared in comment https://github.com/NetApp/harvest/issues/2113#issuecomment-1589911821 but for the sake of discussion, let's assume that was on the volume dashboard.
If it was, my question was, do you know if volumes were deleted during the time that you see the gaps in the panel?
Sometimes folks are running automation or workloads were deletions are expected, other times they know deletions are not expected, and most folks don't really know since they aren't sure how the cluster is being used at any given time.
I was trying to ask you in the earlier comment if you knew whether the cluster in question might have object deletions going on during the time you see gaps. If you aren't sure, that's perfectly fine.
Hi @cgrinds ,
I think that now I understood your questions about the deleted objects which are continuously gathering information from the cluster and you try to understand if during the cuts anybody could delete those objects. My answer is no, we manage the creating the volumes, lun and all the different objects in a ONTAP cluster, so we know when we are deleting but we don't know in the different upper levels of storage what people do if they are deleting information which are inside those objects or not.
I hope my clarification give you the information that you asked.
Thanks.
Hi @faguayot Hardik's changes are in nightly. If you want to try these out before the next release, you can grab the latest nightly build.
Feel free to email the most recent logs for any of your pollers with gaps.
Hello @cgrinds
Yes, I will try to test tomorrow but I don't know when I will find the problems if I do again.
Thanks for the warning.
@faguayot any updates on this issue?
Hello @cgrinds ,
I haven't seen the cuts during this time but I don't know if the problem disappeared with any release or if I didn't see. If you want we can close the issue and if it appear again then we will review.
Thanks.
Thanks for the follow-up @faguayot That's a good idea, we'll close and reopen if you see this again.
Hello,
Recently we are experiencing times in which we have cuts in the recollection of the data. These cuts was happened this night.
But I have catched other days in which this happens:
I don't know if this could be related with the same issue that you deallt here or if it is something different.
@faguayot What is the Harvest version? Could you share logs @ng-harvest-files@netapp.com
https://netapp.github.io/harvest/24.02/help/log-collection/
The harvest version it's the latest available
harvest version 24.02.0-1 (commit 8f9201cc) (build date 2024-02-21T10:01:19-0500) linux/amd64
Refering to the logs, I have sent the logs from today because their are heavy.
@faguayot Thanks. Logs received. Could you also share which metrics have been cut? Please share metric names. Are there any relevant objects, such as volume, SVMs or all perf metrics have cuts? Does this happen to all pollers, or only specific ones? Please feel free to email this information as well.
@rahulguptajss I am focusing on the last time where I have found this cuts which is 19April around 4:28am to 4:36am aproximately. I have checked the rest of pollers and at that time we don't see any cuts in the recollection, so it seems there is only 1 poller with that problem. I will let you know which is the information of the poller by email. The metrics were we have seen those cuts are:
Additionally to the node I think there are other metrics missing in the Power dashboard.
Let me know if you need something else.
Thank you, @faguayot, for the detailed information. I have reviewed the logs for the relevant poller and noticed that the PollInstance task (ZapiPerf) for the system node object has failed, which runs every 10 minutes. However, it only recovered after 10 minutes, according to its schedule, which explains the gap you are observing in the metrics. The error mentioned below is likely due to ONTAP being unable to process the request. This occurrence is only noted once in the logs, which is why you would observe this single interruption
{"level":"error","Poller":"XXXXXX","collector":"ZapiPerf:SystemNode","error":"connection error => Post \"https://servlets/netapp.servlets.admin.XMLrequest_filer\": EOF","request":"perf-object-instance-list-info-iter","batchTag":"","caller":"zapiperf/zapiperf.go:1610","time":"2024-04-19T04:24:17+02:00","message":"InvokeBatchRequest failed"}
There are also errors related to disk and shelf requests, as shown below, where timeouts have occurred. These are due to the Zapi storage-shelf-info-get-iter
and the REST API api/private/cli/storage/shelf
being slow, resulting in timeouts. This also accounts for the gap in the power dashboard, as the power metrics are generated through the code in the disk.go plugin. Could you please run the following ONTAP CLI command for the relevant poller to see if it is fast or slow?
storage shelf show
{"level":"error","Poller":"XXXX","collector":"ZapiPerf:Disk","error":"context deadline exceeded (Client.Timeout or context cancellation while reading body)","plugin":"Disk","caller":"collector/collector.go:444","time":"2024-04-19T03:59:03+02:00"}
{"level":"error","Poller":"XXXX","plugin":"Rest:Health","object":"Health","error":"error making request connection error Get \"/api/private/cli/storage/shelf?fields=error_type%2Cerror_severity%2Cerror_text&return_records=true&start.shelf=xxxxx&start.shelf_uuid=xxxxx\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","href":"api/private/cli/storage/shelf?return_records=true&fields=error_type,error_severity,error_text","caller":"collectors/commonutils.go:54","time":"2024-04-19T10:52:09+02:00","message":"Failed to fetch data"}
Below is a summary of the API response times for this poller, where the NFSLock
API takes over 38 seconds to fetch approximately 19,000 records, and storage-shelf-info-get-iter
takes about 18 seconds. These and other APIs are taking a considerable amount of time. Could you check whether this cluster is also being monitored by other tools, which might contribute to the load/time outs in these API?
Collector |TotalPollTimeMs |AvgApiMs |MaxApiMs |AvgCalcMs |AvgParseMs |AvgPluginMs |AvgInstances |AvgMetrics |Rows
Zapi:NFSLock |38850 |38090 |169528 |0 |760 |0 |19710 |229747 |183
Zapi:QosWorkload |6202 |6172 |76861 |0 |30 |0 |2141 |6423 |208
Zapi:QosPolicyFixed |1150 |1087 |14185 |0 |61 |2 |2126 |17008 |209
ZapiPerf:Workload |10434 |10041 |103668 |6 |387 |0 |2082 |56214 |206
ZapiPerf:Disk |47041 |1818 |48091 |1 |129 |45093 |528 |10032 |604
Zapi:Sensor |15587 |15106 |60037 |0 |35 |446 |1065 |9513 |206
Zapi:Disk |2870 |2837 |16103 |0 |33 |0 |336 |6384 |208
ZapiPerf:LIF |574 |565 |12179 |0 |9 |0 |87 |522 |625
ZapiPerf:Qtree |825 |818 |12012 |0 |7 |0 |71 |284 |625
ZapiPerf:NicCommon |568 |558 |13730 |0 |10 |0 |66 |726 |625
ZapiPerf:Volume |452 |433 |15653 |0 |18 |1 |65 |1755 |625
ZapiPerf:HostAdapter |323 |321 |10269 |0 |2 |0 |48 |96 |615
Zapi:SecurityCert |1410 |820 |10513 |0 |7 |583 |131 |786 |208
Rest:DIMM |1303 |1303 |12514 |0 |0 |0 |96 |480 |208
Zapi:AggregateSpare |1156 |1155 |11028 |0 |1 |0 |94 |376 |208
Zapi:LIF |2693 |2688 |14628 |0 |5 |0 |87 |1479 |208
Zapi:SecurityAccount |690 |688 |11747 |0 |2 |0 |85 |555 |208
Zapi:Qtree |2518 |1579 |12702 |0 |2 |937 |71 |438 |208
Zapi:Volume |5545 |2533 |14128 |0 |13 |2999 |65 |3167 |209
ZapiPerf:WorkloadVolume |662 |652 |12179 |0 |10 |0 |58 |1566 |209
ZapiPerf:VolumeSvm |397 |396 |9687 |0 |1 |0 |10 |100 |625
ZapiPerf:HeadroomAggr |358 |357 |10016 |0 |1 |0 |7 |315 |625
Rest:NetRoute |652 |652 |11772 |0 |0 |0 |19 |138 |208
ZapiPerf:HeadroomCPU |462 |461 |11721 |0 |1 |0 |6 |270 |626
ZapiPerf:NFSv4Node |553 |549 |10554 |0 |4 |0 |6 |486 |625
ZapiPerf:NFSv3Node |594 |591 |8424 |0 |3 |0 |6 |312 |625
ZapiPerf:ExtCacheObj |456 |455 |12143 |0 |1 |0 |6 |102 |625
ZapiPerf:NFSv41Node |563 |558 |12038 |0 |5 |0 |6 |654 |625
ZapiPerf:Rwctx |475 |474 |13165 |0 |1 |0 |6 |780 |625
ZapiPerf:WAFL |309 |308 |7876 |0 |1 |0 |6 |528 |625
ZapiPerf:VolumeNode |701 |699 |13794 |0 |2 |0 |6 |216 |625
ZapiPerf:SystemNode |543 |541 |15255 |0 |2 |0 |6 |294 |615
Zapi:Shelf |18411 |18411 |48420 |0 |0 |0 |15 |135 |209
Zapi:QosPolicyAdaptive |783 |782 |11760 |0 |1 |0 |15 |135 |208
Zapi:Subsystem |2166 |2166 |13348 |0 |0 |0 |13 |52 |208
Zapi:Aggregate |10623 |8877 |25303 |0 |2 |1744 |12 |552 |209
ZapiPerf:CopyManager |666 |666 |12348 |0 |0 |0 |4 |20 |625
ZapiPerf:NFSv3 |522 |520 |10457 |0 |2 |0 |4 |688 |625
ZapiPerf:NFSv41 |687 |683 |14887 |0 |4 |0 |4 |916 |625
Zapi:SVM |1127 |695 |11724 |0 |0 |432 |12 |56 |208
ZapiPerf:NFSv4 |374 |371 |14616 |0 |3 |0 |4 |804 |616
Zapi:SnapshotPolicy |661 |661 |11740 |0 |0 |0 |11 |36 |208
Zapi:Node |833 |833 |10912 |0 |0 |0 |6 |114 |208
Zapi:AggregateEfficiency |871 |871 |11994 |0 |0 |0 |6 |48 |208
Zapi:EmsDestination |689 |689 |11724 |0 |0 |0 |4 |11 |208
Zapi:NtpServer |622 |622 |11706 |0 |0 |0 |3 |6 |208
Rest:SupportAutoUpdate |654 |654 |11696 |0 |0 |0 |1 |1 |208
Zapi:ClusterPeer |669 |669 |11731 |0 |0 |0 |1 |3 |208
Zapi:Status |799 |799 |10529 |0 |0 |0 |1 |1 |208
Rest:SecuritySsh |691 |691 |11688 |0 |0 |0 |1 |3 |208
Zapi:Security |1297 |644 |11696 |0 |0 |653 |1 |1 |208
Zapi:Support |728 |728 |9874 |0 |0 |0 |1 |4 |208
Rest:Health |91368 |784 |12099 |0 |0 |90584 |1 |2 |203
Ems:Ems |1225 |1225 |24625 |0 |0 |0 |0 |0 |208
Hi, @rahulguptajss Thanks for the detailed analysis. We are having peformance issues in this poller so it makes sense what you said about load and response time.
About the command that you share to try took some time to show the result:
storage shelf show
I have stopped the harvest for this poller in others installations that we have for avoid that overload since it is the unique tool accesing for the cluster in conjunction with the OCUM/Active IQ.
@faguayot Sure, I believe that once the root cause of the storage shelf show
slowness is resolved, we can consider enabling Harvest again for this poller. Harvest supports an argument called jitter, that distributes API requests to ONTAP over a scheduled range, which can help manage and balance the load by avoiding the simultaneous sending of all requests originating from Harvest. If you think this might be beneficial, please let us know, and we can provide the necessary steps to enable it.
I didn't know this option and it sounds good. Could you share with me, how can I use? It is an option for all pollers or is configurable only for one? Because I was thinking it will be beneficial for all pollers in general.
Sure. Can you share which installation method are you using for Harvest: RPM, DEB, native, container, or Nabox?
Yes, we are installing harvest by RPM.
Yes, we are installing harvest by RPM.
Thanks. Added here. https://github.com/NetApp/harvest/discussions/2856
A note for the community
Problem
Hello, We have experienced cuts in the collection of metrics from some cluster for some time. I don't think that always happens in the same clusters and it something that I didn't find yet the pattern. This occurs during a period of time for one or some cluster in particular.
I've put in the poller box the cluster where I detected the problem now but as I said before I saw the same behavior previously in other clusters but in the past I thought the problem of the missing data could be related to a restart of the harvest service since the time in the graph was lower.
Missing metrics for the blue line at the beginning (Poller name: sces1p3_03) and yellow line (Poller name: sces1p2_03) later.
The last point for the blue line is 24th may at 12:38 and for the yellow line at 14:50 the same day. These pollers started to draw points again for the blue line on 25th may at 12:35 and for the yellow 26th may at 15:01.
Configuration
Poller
sces1p2_03,sces1p3_03
Version
harvest version 23.05.0-1 (commit 6f74c7a5) (build date 2023-05-03T08:08:46-0400) linux/amd64
Poller logs
OS and platform
Red Hat Enterprise Linux release 8.4 (Ootpa)
ONTAP or StorageGRID version
NetApp Release 9.10.1P8
Additional Context
No response
References
No response