Closed faguayot closed 2 years ago
hi @faguayot the CP (Consistency Points) Counts
panel you pasted above from Harvest 2 Disk dashboard is using the following panel expression (from Edit on the panel menu)
sum(wafl_cp_count{datacenter="$Datacenter",cluster="$Cluster",node=~"$Node"}) by (subcounter)
That counter is from the zapiperf wafl.yaml template.
The Harvest 1.6 panel is from the Node dashboard and uses this expression
alias(transformNull(sumSeries(netapp.perf.$Group.$Cluster.node.$Node.wafl.cp_count.{back-to-back_CP,deferred_back-to-back_CP}), 0), 'Back-to-back CP Count')
Both Harvest 2 and 1.6 are collecting the same ZAPI and counters. That's good :)
Now let's answer some of your questions. If we run the following Harvest tool it will extract the metadata about the counter in question.
bin/zapi --poller u2 show counters --object wafl | less
Search for cp_count
and we find
[counter-info] - *
[desc] - Array of counts of different types of CPs
[is-deprecated] - false
[labels] - *
[label-info] - wafl_timer generated CP,dynamic triggerred CP,snapshot generated CP,wafl_avail_bufs generated CP,dirty_blk_cnt generated CP,full NV
-log generated CP,back-to-back CP,back-to-back CP start,flush generated CP,sync generated CP,deferred back-to-back CP,low mbufs generated CP,low datavecs generated CP,nvlog replay take
over time limit CP
[name] - cp_count
[privilege-level] - diag
[properties] - delta
[type] - array
[unit] - none
This tells us that this counter is an array counts of different kinds of consistency points and since properties = delta
Harvest will calculate the difference between the current value and previous one. Finally, the ZapiPerf collector will take this array and flatten it into individual metrics. In other words, the ZapiPerf collector will take this from ONTAP
<instances>
<instance-data>
<counters>
<counter-data>
<name>cp_count</name>
<value>276318,0,156,0,1146,129,6,72,0,4443,0,0,0,1016</value>
</counter-data>
</counters>
<name>wafl</name>
<uuid>umeng-aff300-05:kernel:wafl</uuid>
</instance-data>
</instances>
and turn it into something like this, doing the delta math for you and creating the correct metric for each array value.
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="dynamic_triggerred_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="low_datavecs_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="deferred_back_to_back_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="back_to_back_CP_start"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="back_to_back_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="low_mbufs_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="flush_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="nvlog_replay_takeover_time_limit_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="wafl_avail_bufs_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="wafl_timer_generated_CP"} 12
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="full_NV_log_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="dirty_blk_cnt_generated_CP"} 1
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="snapshot_generated_CP"} 0
wafl_cp_count{cluster="umeng-aff300-05-06",datacenter="dc-1",node="umeng-aff300-05",metric="sync_generated_CP"} 0
With this information, we can see there is a bug in the Harvest 2 CP (Consistency Points) Counts
panel I pasted above. The expression for that panel has by (subcounter)
when it should probably be by (metric)
since thats' the label of each kind of CP as shown above.
If I change the expression to this
sum(wafl_cp_count{datacenter="$Datacenter",cluster="$Cluster",node=~"$Node"}) by (metric)
I get something more useful that provides more information than the Harvest 1.6 panel. Does this provide what you need?
Alternatively if you want to exactly recreate the Harvest 1.6 panel, I believe this is what you want.
sum(wafl_cp_count{datacenter="$Datacenter",cluster="$Cluster",node=~"$Node",metric=~"back_to_back_CP|deferred_back_to_back_CP"})
Let us know which of these you prefer and we'll improve this. Regardless of which you prefer, I'll check on adding the write latency into the same panel.
Hello @cgrinds
Firstly, sorry for the delay and let me appreciate the time that you always spend reading/understanding our problems/visions/thoughts and investigate what it is missing or where is the problem.
Secondly before we opened the ticket we had been checking in the prometheus which wafl metrics there were but we don't know what it is every of them (we don't know the meaning of them). Thank you again for every command, trick that you show me in every issue to see more information for a future doubt or problem that I have :).
Regarding your question I would like to suggest to have the same metrics (back-to-back_CP|deferred_back-to-back_CP) we had in harvest 1.6 with the additional of write latency that you comment too.
The request of this implementation is because in the past and recently we checked those metrics when we have been a performance problems.
Let me know if I can help you or testing when you'll implemented.
Best regards!
Hi @faguayot no worries on the delay and thanks for the kind words! We appreciate it.
Pull request #1049 includes the changes for this feature request. The panel on the Disk
dashboard was updated to look like the following. Let us know if that's what you're looking for.
This will be in tomorrow's (2022-05-25) nightly build if you want to try it out. Or you can apply the changes to your local install since there are not code changes, only template and dashboard changes in conf/zapiperf/cdot/9.8.0/volume_node.yaml
and grafana/dashboards/cmode/harvest_dashboard_disk.json
.
Hello @cgrinds and @rahulguptajss ,
Thanks, it looks great.
@faguayot Thanks for the confirmation. This PR is now available in nightly build
Verified on release/22.08.0
a05a0279
Describe the bug A clear and concise description of what the bug is. I don't know if this could be a BUG or new feature. Looking for for the consistency points metrics, I've found the following graph in the Disk dashboard.
What are the value? Time? Count? Depends of that the metric scale should be fine or not. Additionally I miss from the oldest harvest to have a graph with the both parameters
write latency
andback to back CP count
.Environment Provide accurate information about the environment to help us reproduce the issue.
harvest version 22.05.0-1 (commit 2bc2942) (build date 2022-05-11T07:56:11-0400) linux/amd64
bin/harvest start --config=foo.yml --collectors Zapi
]Expected behavior Recognize what metric it is showing in the graph
Actual behavior I don't understand if it is a CP count or timer
Possible solution, workaround, fix Fix the query and add if it is possible the another metric which I miss from previous harvest.