NetApp / harvest

Open-metrics endpoint for ONTAP and StorageGRID
https://netapp.github.io/harvest/latest
Apache License 2.0
147 stars 36 forks source link

Consistency Point Count or time with maybe incorrect scale #1040

Closed faguayot closed 2 years ago

faguayot commented 2 years ago

Describe the bug A clear and concise description of what the bug is. I don't know if this could be a BUG or new feature. Looking for for the consistency points metrics, I've found the following graph in the Disk dashboard.

image

What are the value? Time? Count? Depends of that the metric scale should be fine or not. Additionally I miss from the oldest harvest to have a graph with the both parameters write latency and back to back CP count.

image

Environment Provide accurate information about the environment to help us reproduce the issue.

Expected behavior Recognize what metric it is showing in the graph

Actual behavior I don't understand if it is a CP count or timer

Possible solution, workaround, fix Fix the query and add if it is possible the another metric which I miss from previous harvest.

cgrinds commented 2 years ago

hi @faguayot the CP (Consistency Points) Counts panel you pasted above from Harvest 2 Disk dashboard is using the following panel expression (from Edit on the panel menu)

sum(wafl_cp_count{datacenter="$Datacenter",cluster="$Cluster",node=~"$Node"}) by (subcounter)

That counter is from the zapiperf wafl.yaml template.

The Harvest 1.6 panel is from the Node dashboard and uses this expression

alias(transformNull(sumSeries(netapp.perf.$Group.$Cluster.node.$Node.wafl.cp_count.{back-to-back_CP,deferred_back-to-back_CP}), 0), 'Back-to-back CP Count')

Both Harvest 2 and 1.6 are collecting the same ZAPI and counters. That's good :)

Now let's answer some of your questions. If we run the following Harvest tool it will extract the metadata about the counter in question.

bin/zapi --poller u2 show counters --object wafl | less

Search for cp_count and we find

  [counter-info]                                   -                                   *
    [desc]                                         - Array of counts of different types of CPs
    [is-deprecated]                                -                               false
    [labels]                                       -                                   *
      [label-info]                                 - wafl_timer generated CP,dynamic triggerred CP,snapshot generated CP,wafl_avail_bufs generated CP,dirty_blk_cnt generated CP,full NV
-log generated CP,back-to-back CP,back-to-back CP start,flush generated CP,sync generated CP,deferred back-to-back CP,low mbufs generated CP,low datavecs generated CP,nvlog replay take
over time limit CP
    [name]                                         -                            cp_count
    [privilege-level]                              -                                diag
    [properties]                                   -                               delta
    [type]                                         -                               array
    [unit]                                         -                                none

This tells us that this counter is an array counts of different kinds of consistency points and since properties = delta Harvest will calculate the difference between the current value and previous one. Finally, the ZapiPerf collector will take this array and flatten it into individual metrics. In other words, the ZapiPerf collector will take this from ONTAP

<instances>
<instance-data>
    <counters>
    <counter-data>
        <name>cp_count</name>
        <value>276318,0,156,0,1146,129,6,72,0,4443,0,0,0,1016</value>
    </counter-data>
    </counters>
    <name>wafl</name>
    <uuid>umeng-aff300-05:kernel:wafl</uuid>
</instance-data>
</instances>

and turn it into something like this, doing the delta math for you and creating the correct metric for each array value.

wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="dynamic_triggerred_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="low_datavecs_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="deferred_back_to_back_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="back_to_back_CP_start"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="back_to_back_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="low_mbufs_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="flush_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="nvlog_replay_takeover_time_limit_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="wafl_avail_bufs_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="wafl_timer_generated_CP"} 12
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="full_NV_log_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="dirty_blk_cnt_generated_CP"} 1
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="snapshot_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="sync_generated_CP"} 0

With this information, we can see there is a bug in the Harvest 2 CP (Consistency Points) Counts panel I pasted above. The expression for that panel has by (subcounter) when it should probably be by (metric) since thats' the label of each kind of CP as shown above.

If I change the expression to this

sum(wafl_cp_count{datacenter="$Datacenter",cluster="$Cluster",node=~"$Node"}) by (metric)

I get something more useful that provides more information than the Harvest 1.6 panel. Does this provide what you need?

image

Alternatively if you want to exactly recreate the Harvest 1.6 panel, I believe this is what you want.

sum(wafl_cp_count{datacenter="$Datacenter",cluster="$Cluster",node=~"$Node",metric=~"back_to_back_CP|deferred_back_to_back_CP"})

Let us know which of these you prefer and we'll improve this. Regardless of which you prefer, I'll check on adding the write latency into the same panel.

faguayot commented 2 years ago

Hello @cgrinds

Firstly, sorry for the delay and let me appreciate the time that you always spend reading/understanding our problems/visions/thoughts and investigate what it is missing or where is the problem.

Secondly before we opened the ticket we had been checking in the prometheus which wafl metrics there were but we don't know what it is every of them (we don't know the meaning of them). Thank you again for every command, trick that you show me in every issue to see more information for a future doubt or problem that I have :).

Regarding your question I would like to suggest to have the same metrics (back-to-back_CP|deferred_back-to-back_CP) we had in harvest 1.6 with the additional of write latency that you comment too.

The request of this implementation is because in the past and recently we checked those metrics when we have been a performance problems.

Let me know if I can help you or testing when you'll implemented.

Best regards!

cgrinds commented 2 years ago

Hi @faguayot no worries on the delay and thanks for the kind words! We appreciate it.

Pull request #1049 includes the changes for this feature request. The panel on the Disk dashboard was updated to look like the following. Let us know if that's what you're looking for.

This will be in tomorrow's (2022-05-25) nightly build if you want to try it out. Or you can apply the changes to your local install since there are not code changes, only template and dashboard changes in conf/zapiperf/cdot/9.8.0/volume_node.yaml and grafana/dashboards/cmode/harvest_dashboard_disk.json.

image

faguayot commented 2 years ago

Hello @cgrinds and @rahulguptajss ,

Thanks, it looks great.

rahulguptajss commented 2 years ago

@faguayot Thanks for the confirmation. This PR is now available in nightly build

cgrinds commented 2 years ago

Verified on release/22.08.0 a05a0279

image