NetApp / harvest

Open-metrics endpoint for ONTAP and StorageGRID
https://netapp.github.io/harvest/latest
Apache License 2.0
147 stars 36 forks source link

Cuts in the collection of metrics of some cluster #2113

Closed faguayot closed 10 months ago

faguayot commented 1 year ago

A note for the community

Problem

Hello, We have experienced cuts in the collection of metrics from some cluster for some time. I don't think that always happens in the same clusters and it something that I didn't find yet the pattern. This occurs during a period of time for one or some cluster in particular.

I've put in the poller box the cluster where I detected the problem now but as I said before I saw the same behavior previously in other clusters but in the past I thought the problem of the missing data could be related to a restart of the harvest service since the time in the graph was lower.

Missing metrics for the blue line at the beginning (Poller name: sces1p3_03) and yellow line (Poller name: sces1p2_03) later. image

The last point for the blue line is 24th may at 12:38 and for the yellow line at 14:50 the same day. These pollers started to draw points again for the blue line on 25th may at 12:35 and for the yellow 26th may at 15:01.

Configuration

Tools:
    grafana_api_token: -REDACTED-
Exporters:
    prometheus:
        exporter: Prometheus
        addr: -REDACTED-
        master: True
        port_range: 12989-13010
    influx:
        exporter: InfluxDB
        url: '-REDACTED-'
        addr: '-REDACTED-'
        bucket: ''
        org: ''
        token: '-REDACTED-'
Defaults:
    client_timeout: 60
    auth_style: basic_auth
    username: -REDACTED-
    password: -REDACTED-
    use_insecure_tls: true
    log_max_bytes: 10000000
    log_max_files: 10
    prefer_zapi: yes
    collectors:
        - Zapi
        - ZapiPerf
        - Ems
        - Rest
        - RestPerf
    exporters:
        - prometheus
        - influx
Pollers:
    unix:
        datacenter: local
        addr: -REDACTED-
        collectors:
            - Unix
        exporters:
            - prometheus
    SCES3T000:
        datacenter: technical
        addr: -REDACTED-
        exporters:
            - prometheus
    sces1t0_01:
        datacenter: technical
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p1_03:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p2_03:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p3_03:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    SCES1P000:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p1_02:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p2_02:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p3_02:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    SCES1T200:
        datacenter: technical
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    SCES1P200:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p0_04:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p1_01:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p2_01:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces1p3_01:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    SCES3P200:
        datacenter: brs
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces2p0_01:
        datacenter: production
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx
    sces4p0_01:
        datacenter: colocation
        addr: -REDACTED-
        exporters:
            - prometheus
            - influx

Poller

sces1p2_03,sces1p3_03

Version

harvest version 23.05.0-1 (commit 6f74c7a5) (build date 2023-05-03T08:08:46-0400) linux/amd64

Poller logs

I will send you the logs file to the email.

OS and platform

Red Hat Enterprise Linux release 8.4 (Ootpa)

ONTAP or StorageGRID version

NetApp Release 9.10.1P8

Additional Context

No response

References

No response

rahulguptajss commented 1 year ago

@faguayot Could you please resend the log files? We were only able to extract one gz file named poller_sces1p1_03-2023-05-24T19-10-00.664.log, and it appears that the rest of the files are corrupted.

faguayot commented 1 year ago

@rahulguptajss I've tried to attach again the files. Let me know if they are ok or you are still suffering the problem with the corruption.

Hardikl commented 1 year ago

@faguayot Thanks for sharing the log files again. We have received them and did initial investigation. Found few errors:

  1. duplicate instancekey error, @rahulguptajss has created issue for this and we will look into it.

    {"level":"error","Poller":"sces1p3_03","plugin":"ZapiPerf:Volume","object":"Volume","error":"duplicate instance key => .","key":".","caller":"volume/volume.go:88","time":"2023-05-24T10:28:02+02:00","message":"Failed to create new instance"}
  2. System busy errors, These logs exist in both pollers and started to occur for poller xxxx_03_03 around 12:34(as you have mentioned). This logs are coming directly from ONTAP and Harvest show them as-is.

    {"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:LIF","error":"API request rejected => System busy: 17 requests on table \"perf_object_get_instances\" have been pending for 2 seconds. The last completed call took 94 seconds.","task":"data","caller":"collector/collector.go:365","time":"2023-05-24T12:34:02+02:00","message":"API rejected, entering standby mode"}
    {"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:NicCommon","error":"API request rejected => System busy: 17 requests on table \"perf_object_get_instances\" have been pending for 2 seconds. The last completed call took 94 seconds.","task":"data","caller":"collector/collector.go:365","time":"2023-05-24T12:34:02+02:00","message":"API rejected, entering standby mode"}
    ...

While searching System busy errors in ONTAP side, found this burt 1387773 which has same error stack.

Sharing the description detail and Public Review mentioned in the burt:

DESCRIPTION:

When counter manager appears to be hung which usually manifests as a problem in 
terms of gathering perf data or polling perf objects, following kind of messages
from mgwd would help in ascertaining that the issue is indeed in counter manager
daemon:
>>Attempt to get cm object resource_headroom_cpu failed:System busy: 7 requests on
table "perf_object_get_instances" have been pending for 13510621 seconds. The
last completed call took 0 seconds.

PR

------------------------
%%% TITLE: Blocked object and transport threads hamper counter manager's functionality
%%% DESCRIPTION:
Performance NetApp Manageability SDK (NMSDK) requests appear to get blocked after 
hitting the storage controller with multiple performance NMSDKs. This blockage occurs 
because the transport threads indefinitely waits on a mutex which, on the other
hand, is held by a dead thread. 
%%% WORKAROUND:
To temporarily resolve the issue, restart of the counter manager daemon(CMD) process on 
each node.
%%% NOTES:
None
--------------
  1. NFSv4 related issues, These logs exist in both pollers, Need to check if this is related to Counter Manager as mentioned above or the NFS object itself not exist in cluster - which can be valid scenario.
    {"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:NFSv42Node","error":"API request rejected => Object \"nfsv4_2:node\" was not found.","task":"counter","caller":"collector/collector.go:365","time":"2023-05-24T11:08:09+02:00","message":"API rejected, entering standby mode"}
    {"level":"info","Poller":"sces1p3_03","collector":"ZapiPerf:NFSv42","error":"API request rejected => Object \"nfsv4_2\" was not found.","task":"counter","caller":"collector/collector.go:365","time":"2023-05-24T11:08:09+02:00","message":"API rejected, entering standby mode"}
Hardikl commented 1 year ago

@faguayot For issue 1 duplicate instancekey error, Could you please share the zapi response by invoking these commands and send them to our email address ? ./bin/harvest zapi -p sces1p3_03 show data --object volume ./bin/harvest zapi -p sces1p1_03 show data --object volume

faguayot commented 1 year ago

Sure @Hardikl I will share the information using the email.

Thanks.

faguayot commented 1 year ago

I catched a new moment where it was happening the issue in one cluster. image

cgrinds commented 1 year ago

hi @faguayot if you have the logs for that poller, please email them to us so we can take a look.

Hardik did a nice job analyzing the last set of logs you sent and found a logging issue. Let me explain what the problem is and why it might be related to the gaps shown above.

The Harvest perf collector works like this:

  1. Harvest asks ONTAP for the instances of a resource (e.g. Volume)
  2. Harvest uses the results of 1) to ask ONTAP for the counters for those instances
  3. Harvest uses the results of 2) to calculate deltas, rates, etc.

The problem Hardik found in your logs occurs between 1 and 2. The instance we collected at step 1 was deleted by the time we got to step 2. That means at step 2, Harvest asks ONTAP for the counters of an instance that no longer exists. ONTAP returns an empty record when that happens, and Harvest was noisier about logging this than it should be. Hardik is working on a fix for that.

Is is possible that some of the gaps you're seeing on this cluster are because there are creates/deletes? Depending on the dashboard and whether the panel(s) are using topk or not, I think instance deletions could manifest as a gap because, on the left side of the gap, the instance was included in the top5, then it was deleted, and on the right side of the gap, a new instance was picked for the top5.

faguayot commented 1 year ago

Hello @cgrinds,

It is great that you found the problem with the many information that you had to review. I understood your explanation chris, thanks for making really clear always. If I am right this issue is happening with every instance (Aggregate, Volume, Node, Lun,etc) that gather of performance metrics/counters, is it correct my affirmation?

I will wait until you have the fix to check if the behaviour it is happening again or not. Thanks.

cgrinds commented 1 year ago

Hi @faguayot

While @Hardikl works on a potential fix for this issue, can you email the most recent logs for one of your pollers with gaps?

Any idea if that cluster may have deletions during the time of the gaps? And yes, to your question, the instance removal logic is the same for all object types.

faguayot commented 1 year ago

Hello @cgrinds I don't remember which cluster was this behaviour and I unknown during which period of time we have seen these cuts. Apologize but I didn't take note of this info. If I see the same behaviour in other times, I directly will share the logs and those info. I don't know what are you refering about the deletions? Where?

Thanks.

cgrinds commented 1 year ago

Hi @faguayot

Sounds good! I can't tell which dashboard panel you shared in comment https://github.com/NetApp/harvest/issues/2113#issuecomment-1589911821 but for the sake of discussion, let's assume that was on the volume dashboard.

If it was, my question was, do you know if volumes were deleted during the time that you see the gaps in the panel?

Sometimes folks are running automation or workloads were deletions are expected, other times they know deletions are not expected, and most folks don't really know since they aren't sure how the cluster is being used at any given time.

I was trying to ask you in the earlier comment if you knew whether the cluster in question might have object deletions going on during the time you see gaps. If you aren't sure, that's perfectly fine.

faguayot commented 1 year ago

Hi @cgrinds ,

I think that now I understood your questions about the deleted objects which are continuously gathering information from the cluster and you try to understand if during the cuts anybody could delete those objects. My answer is no, we manage the creating the volumes, lun and all the different objects in a ONTAP cluster, so we know when we are deleting but we don't know in the different upper levels of storage what people do if they are deleting information which are inside those objects or not.

I hope my clarification give you the information that you asked.

Thanks.

cgrinds commented 1 year ago

Hi @faguayot Hardik's changes are in nightly. If you want to try these out before the next release, you can grab the latest nightly build.

Feel free to email the most recent logs for any of your pollers with gaps.

faguayot commented 1 year ago

Hello @cgrinds

Yes, I will try to test tomorrow but I don't know when I will find the problems if I do again.

Thanks for the warning.

cgrinds commented 10 months ago

@faguayot any updates on this issue?

faguayot commented 10 months ago

Hello @cgrinds ,

I haven't seen the cuts during this time but I don't know if the problem disappeared with any release or if I didn't see. If you want we can close the issue and if it appear again then we will review.

Thanks.

cgrinds commented 10 months ago

Thanks for the follow-up @faguayot That's a good idea, we'll close and reopen if you see this again.

faguayot commented 4 months ago

Hello,

Recently we are experiencing times in which we have cuts in the recollection of the data. These cuts was happened this night.

image

But I have catched other days in which this happens:

image

I don't know if this could be related with the same issue that you deallt here or if it is something different.

rahulguptajss commented 4 months ago

@faguayot What is the Harvest version? Could you share logs @ng-harvest-files@netapp.com https://netapp.github.io/harvest/24.02/help/log-collection/

faguayot commented 4 months ago

The harvest version it's the latest available

harvest version 24.02.0-1 (commit 8f9201cc) (build date 2024-02-21T10:01:19-0500) linux/amd64

Refering to the logs, I have sent the logs from today because their are heavy.

rahulguptajss commented 4 months ago

@faguayot Thanks. Logs received. Could you also share which metrics have been cut? Please share metric names. Are there any relevant objects, such as volume, SVMs or all perf metrics have cuts? Does this happen to all pollers, or only specific ones? Please feel free to email this information as well.

faguayot commented 4 months ago

@rahulguptajss I am focusing on the last time where I have found this cuts which is 19April around 4:28am to 4:36am aproximately. I have checked the rest of pollers and at that time we don't see any cuts in the recollection, so it seems there is only 1 poller with that problem. I will let you know which is the information of the poller by email. The metrics were we have seen those cuts are:

Additionally to the node I think there are other metrics missing in the Power dashboard.

image Let me know if you need something else.

rahulguptajss commented 4 months ago

Thank you, @faguayot, for the detailed information. I have reviewed the logs for the relevant poller and noticed that the PollInstance task (ZapiPerf) for the system node object has failed, which runs every 10 minutes. However, it only recovered after 10 minutes, according to its schedule, which explains the gap you are observing in the metrics. The error mentioned below is likely due to ONTAP being unable to process the request. This occurrence is only noted once in the logs, which is why you would observe this single interruption

{"level":"error","Poller":"XXXXXX","collector":"ZapiPerf:SystemNode","error":"connection error => Post \"https://servlets/netapp.servlets.admin.XMLrequest_filer\": EOF","request":"perf-object-instance-list-info-iter","batchTag":"","caller":"zapiperf/zapiperf.go:1610","time":"2024-04-19T04:24:17+02:00","message":"InvokeBatchRequest failed"}

There are also errors related to disk and shelf requests, as shown below, where timeouts have occurred. These are due to the Zapi storage-shelf-info-get-iter and the REST API api/private/cli/storage/shelf being slow, resulting in timeouts. This also accounts for the gap in the power dashboard, as the power metrics are generated through the code in the disk.go plugin. Could you please run the following ONTAP CLI command for the relevant poller to see if it is fast or slow?

storage shelf show
{"level":"error","Poller":"XXXX","collector":"ZapiPerf:Disk","error":"context deadline exceeded (Client.Timeout or context cancellation while reading body)","plugin":"Disk","caller":"collector/collector.go:444","time":"2024-04-19T03:59:03+02:00"}
{"level":"error","Poller":"XXXX","plugin":"Rest:Health","object":"Health","error":"error making request connection error Get \"/api/private/cli/storage/shelf?fields=error_type%2Cerror_severity%2Cerror_text&return_records=true&start.shelf=xxxxx&start.shelf_uuid=xxxxx\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","href":"api/private/cli/storage/shelf?return_records=true&fields=error_type,error_severity,error_text","caller":"collectors/commonutils.go:54","time":"2024-04-19T10:52:09+02:00","message":"Failed to fetch data"}

Below is a summary of the API response times for this poller, where the NFSLock API takes over 38 seconds to fetch approximately 19,000 records, and storage-shelf-info-get-iter takes about 18 seconds. These and other APIs are taking a considerable amount of time. Could you check whether this cluster is also being monitored by other tools, which might contribute to the load/time outs in these API?

Collector                |TotalPollTimeMs |AvgApiMs |MaxApiMs |AvgCalcMs |AvgParseMs |AvgPluginMs |AvgInstances |AvgMetrics |Rows 
Zapi:NFSLock             |38850           |38090    |169528   |0         |760        |0           |19710        |229747     |183  
Zapi:QosWorkload         |6202            |6172     |76861    |0         |30         |0           |2141         |6423       |208  
Zapi:QosPolicyFixed      |1150            |1087     |14185    |0         |61         |2           |2126         |17008      |209  
ZapiPerf:Workload        |10434           |10041    |103668   |6         |387        |0           |2082         |56214      |206  
ZapiPerf:Disk            |47041           |1818     |48091    |1         |129        |45093       |528          |10032      |604  
Zapi:Sensor              |15587           |15106    |60037    |0         |35         |446         |1065         |9513       |206  
Zapi:Disk                |2870            |2837     |16103    |0         |33         |0           |336          |6384       |208  
ZapiPerf:LIF             |574             |565      |12179    |0         |9          |0           |87           |522        |625  
ZapiPerf:Qtree           |825             |818      |12012    |0         |7          |0           |71           |284        |625  
ZapiPerf:NicCommon       |568             |558      |13730    |0         |10         |0           |66           |726        |625  
ZapiPerf:Volume          |452             |433      |15653    |0         |18         |1           |65           |1755       |625  
ZapiPerf:HostAdapter     |323             |321      |10269    |0         |2          |0           |48           |96         |615  
Zapi:SecurityCert        |1410            |820      |10513    |0         |7          |583         |131          |786        |208  
Rest:DIMM                |1303            |1303     |12514    |0         |0          |0           |96           |480        |208  
Zapi:AggregateSpare      |1156            |1155     |11028    |0         |1          |0           |94           |376        |208  
Zapi:LIF                 |2693            |2688     |14628    |0         |5          |0           |87           |1479       |208  
Zapi:SecurityAccount     |690             |688      |11747    |0         |2          |0           |85           |555        |208  
Zapi:Qtree               |2518            |1579     |12702    |0         |2          |937         |71           |438        |208  
Zapi:Volume              |5545            |2533     |14128    |0         |13         |2999        |65           |3167       |209  
ZapiPerf:WorkloadVolume  |662             |652      |12179    |0         |10         |0           |58           |1566       |209  
ZapiPerf:VolumeSvm       |397             |396      |9687     |0         |1          |0           |10           |100        |625  
ZapiPerf:HeadroomAggr    |358             |357      |10016    |0         |1          |0           |7            |315        |625  
Rest:NetRoute            |652             |652      |11772    |0         |0          |0           |19           |138        |208  
ZapiPerf:HeadroomCPU     |462             |461      |11721    |0         |1          |0           |6            |270        |626  
ZapiPerf:NFSv4Node       |553             |549      |10554    |0         |4          |0           |6            |486        |625  
ZapiPerf:NFSv3Node       |594             |591      |8424     |0         |3          |0           |6            |312        |625  
ZapiPerf:ExtCacheObj     |456             |455      |12143    |0         |1          |0           |6            |102        |625  
ZapiPerf:NFSv41Node      |563             |558      |12038    |0         |5          |0           |6            |654        |625  
ZapiPerf:Rwctx           |475             |474      |13165    |0         |1          |0           |6            |780        |625  
ZapiPerf:WAFL            |309             |308      |7876     |0         |1          |0           |6            |528        |625  
ZapiPerf:VolumeNode      |701             |699      |13794    |0         |2          |0           |6            |216        |625  
ZapiPerf:SystemNode      |543             |541      |15255    |0         |2          |0           |6            |294        |615  
Zapi:Shelf               |18411           |18411    |48420    |0         |0          |0           |15           |135        |209  
Zapi:QosPolicyAdaptive   |783             |782      |11760    |0         |1          |0           |15           |135        |208  
Zapi:Subsystem           |2166            |2166     |13348    |0         |0          |0           |13           |52         |208  
Zapi:Aggregate           |10623           |8877     |25303    |0         |2          |1744        |12           |552        |209  
ZapiPerf:CopyManager     |666             |666      |12348    |0         |0          |0           |4            |20         |625  
ZapiPerf:NFSv3           |522             |520      |10457    |0         |2          |0           |4            |688        |625  
ZapiPerf:NFSv41          |687             |683      |14887    |0         |4          |0           |4            |916        |625  
Zapi:SVM                 |1127            |695      |11724    |0         |0          |432         |12           |56         |208  
ZapiPerf:NFSv4           |374             |371      |14616    |0         |3          |0           |4            |804        |616  
Zapi:SnapshotPolicy      |661             |661      |11740    |0         |0          |0           |11           |36         |208  
Zapi:Node                |833             |833      |10912    |0         |0          |0           |6            |114        |208  
Zapi:AggregateEfficiency |871             |871      |11994    |0         |0          |0           |6            |48         |208  
Zapi:EmsDestination      |689             |689      |11724    |0         |0          |0           |4            |11         |208  
Zapi:NtpServer           |622             |622      |11706    |0         |0          |0           |3            |6          |208  
Rest:SupportAutoUpdate   |654             |654      |11696    |0         |0          |0           |1            |1          |208  
Zapi:ClusterPeer         |669             |669      |11731    |0         |0          |0           |1            |3          |208  
Zapi:Status              |799             |799      |10529    |0         |0          |0           |1            |1          |208  
Rest:SecuritySsh         |691             |691      |11688    |0         |0          |0           |1            |3          |208  
Zapi:Security            |1297            |644      |11696    |0         |0          |653         |1            |1          |208  
Zapi:Support             |728             |728      |9874     |0         |0          |0           |1            |4          |208  
Rest:Health              |91368           |784      |12099    |0         |0          |90584       |1            |2          |203  
Ems:Ems                  |1225            |1225     |24625    |0         |0          |0           |0            |0          |208  
faguayot commented 4 months ago

Hi, @rahulguptajss Thanks for the detailed analysis. We are having peformance issues in this poller so it makes sense what you said about load and response time.

About the command that you share to try took some time to show the result: storage shelf show

I have stopped the harvest for this poller in others installations that we have for avoid that overload since it is the unique tool accesing for the cluster in conjunction with the OCUM/Active IQ.

rahulguptajss commented 4 months ago

@faguayot Sure, I believe that once the root cause of the storage shelf show slowness is resolved, we can consider enabling Harvest again for this poller. Harvest supports an argument called jitter, that distributes API requests to ONTAP over a scheduled range, which can help manage and balance the load by avoiding the simultaneous sending of all requests originating from Harvest. If you think this might be beneficial, please let us know, and we can provide the necessary steps to enable it.

faguayot commented 4 months ago

I didn't know this option and it sounds good. Could you share with me, how can I use? It is an option for all pollers or is configurable only for one? Because I was thinking it will be beneficial for all pollers in general.

rahulguptajss commented 4 months ago

Sure. Can you share which installation method are you using for Harvest: RPM, DEB, native, container, or Nabox?

faguayot commented 4 months ago

Yes, we are installing harvest by RPM.

rahulguptajss commented 4 months ago

Yes, we are installing harvest by RPM.

Thanks. Added here. https://github.com/NetApp/harvest/discussions/2856