Closed SckyzO closed 6 months ago
Hi @SckyzO,
you specified wrong scrape_interval in "gpfs_filesetquota" job. As described here, the scrape job interval must also match the sensor period configured by the IBM Storage Scale performance monitoring tool. Using endpoint /sensorsconfig you can check the active sensors settings. Usually the GPFSFilesetQuota sensor is running once per day.
[root@RHEL92-32 tmp]# curl https://<pod-ip>:9250/sensorsconfig -u scale_grafana:<apiKeyValue> -k | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2437 100 2437 0 0 71676 0 --:--:-- --:--:-- --:--:-- 71676
[
{
"name": "\"CPU\"",
"period": "1"
},
{
"name": "\"Load\"",
"period": "1"
},
{
"name": "\"Memory\"",
"period": "1"
},
{
"filter": "\"netdev_name=veth.*|docker.*|flannel.*|cali.*|cbr.*\"",
"name": "\"Network\"",
"period": "1"
},
{
"name": "\"Netstat\"",
"period": "10"
},
{
"name": "\"Diskstat\"",
"period": "0"
},
{
"filter": "\"mountPoint=/.*/docker.*|/.*/kubelet.*\"",
"name": "\"DiskFree\"",
"period": "600"
},
{
"name": "\"Infiniband\"",
"period": "0"
},
{
"name": "\"GPFSDisk\"",
"period": "0"
},
{
"name": "\"GPFSFilesystem\"",
"period": "10"
},
{
"name": "\"GPFSNSDDisk\"",
"period": "10",
"restrict": "\"nsdNodes\""
},
{
"name": "\"GPFSPoolIO\"",
"period": "0"
},
{
"name": "\"GPFSVFSX\"",
"period": "10"
},
{
"name": "\"GPFSIOC\"",
"period": "0"
},
{
"name": "\"GPFSVIO64\"",
"period": "0"
},
{
"name": "\"GPFSPDDisk\"",
"period": "10",
"restrict": "\"nsdNodes\""
},
{
"name": "\"GPFSvFLUSH\"",
"period": "0"
},
{
"name": "\"GPFSNode\"",
"period": "10"
},
{
"name": "\"GPFSNodeAPI\"",
"period": "10"
},
{
"name": "\"GPFSFilesystemAPI\"",
"period": "10"
},
{
"name": "\"GPFSLROC\"",
"period": "0"
},
{
"name": "\"GPFSCHMS\"",
"period": "0"
},
{
"name": "\"GPFSAFM\"",
"period": "0"
},
{
"name": "\"GPFSAFMFS\"",
"period": "0"
},
{
"name": "\"GPFSAFMFSET\"",
"period": "0"
},
{
"name": "\"GPFSRPCS\"",
"period": "10"
},
{
"name": "\"GPFSWaiters\"",
"period": "10"
},
{
"name": "\"GPFSFilesetQuota\"",
"period": "3600",
"restrict": "\"@CLUSTER_PERF_SENSOR\""
},
{
"name": "\"GPFSFileset\"",
"period": "300",
"restrict": "\"@CLUSTER_PERF_SENSOR\""
},
{
"name": "\"GPFSPool\"",
"period": "300",
"restrict": "\"@CLUSTER_PERF_SENSOR\""
},
{
"name": "\"GPFSDiskCap\"",
"period": "86400",
"restrict": "\"@CLUSTER_PERF_SENSOR\""
},
{
"name": "\"GPFSEventProducer\"",
"period": "0"
},
{
"name": "\"GPFSMutex\"",
"period": "0"
},
{
"name": "\"GPFSCondvar\"",
"period": "0"
},
{
"name": "\"TopProc\"",
"period": "60"
},
{
"name": "\"GPFSQoS\"",
"period": "0"
},
{
"name": "\"GPFSFCM\"",
"period": "0"
},
{
"name": "\"GPFSBufMgr\"",
"period": "30"
},
{
"name": "\"NFSIO\"",
"period": "0",
"restrict": "\"cesNodes\"",
"type": "\"Generic\""
},
{
"name": "\"SMBStats\"",
"period": "0",
"restrict": "\"cesNodes\"",
"type": "\"Generic\""
},
{
"name": "\"SMBGlobalStats\"",
"period": "0",
"restrict": "\"cesNodes\"",
"type": "\"Generic\""
},
{
"name": "\"CTDBStats\"",
"period": "0",
"restrict": "\"cesNodes\"",
"type": "\"Generic\""
},
{
"name": "\"CTDBDBStats\"",
"period": "0",
"restrict": "\"cesNodes\"",
"type": "\"Generic\""
}
]
Ahhhh! I didn't know this endpoint existed :-) I couldn't find any information on endpoint periode This issue is a good point to add information to the documentation :P
Thank you @Helene :-)
Huuuu ... last question, you say gpfsfilesetquota sensor is running once per day, but if I read the output, i see 3600 (1h ?), right ?
{
"name": "\"GPFSFilesetQuota\"",
"period": "3600",
"restrict": "\"@CLUSTER_PERF_SENSOR\""
},
Because I had the same "issue" with 3600s ...
I don't have a straight curve, which means that graphana dashboards cannot be used with these metrics. Do you see what I mean?
Huuuu ... last question, you say gpfsfilesetquota sensor is running once per day, but if I read the output, i see 3600 (1h ?), right ?
{ "name": "\"GPFSFilesetQuota\"", "period": "3600", "restrict": "\"@CLUSTER_PERF_SENSOR\"" },
Because I had the same "issue" with 3600s ...
Sorry you are right, it is running "hourly", not daily.
It looks like Quota management is not enabled on your filesystem. You can it easily check by executing the following command on any gpfs node:
[root@scale-11 ~]# mmcheckquota localFS
localFS: quota management is not enabled, or one or more quota clients are not available.
mmcheckquota: Command failed. Examine previous error messages to determine cause.
In this case you need to execute
# mmchfs localFS -Q yes --perfileset-quota
For more about Quota Management please read https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=quotas-enabling-disabling-gpfs-quota-management#instq
Note: After Enabling Quota, it can take a half hour before the sensor starts to collect data
But ... my quotas are already enabled :)
[root@compute002 ~]# mmlsfs gpfs -Q --perfileset-quota
flag value description
------------------- ------------------------ -----------------------------------
-Q user;group;fileset Quotas accounting enabled
user;group;fileset Quotas enforced
user;group;fileset Default quotas enabled
--perfileset-quota yes Per-fileset quota enforcement
I don't think it is a gpfs issue, others metrics works perfectly, just this is in "error"
on my system it works fine. Check if the zimon is returning data by executing the following command on pmcollector node
# time echo "get group GPFSFilesetQuota bucket_size 3600 last 5" | /opt/IBM/zimon/zc
Yes, this command return many data ...
In prometheus WebUI, if you request metrics : gpfs_rq_blk_current, in graph mode, like me, do you have a straight curve ? or do you have a hatched/cut line (like me) ?
try to update metadata
[root@RHEL92-32 ~]# curl https://<pod-ip>:9250/update -u scale_grafana:<apiKeyValue> -k | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 42 100 42 0 0 162 0 --:--:-- --:--:-- --:--:-- 163
{
"msg": "Successfully retrieved MetaData"
}
Hello,
I use metrics_gpfs_filesetquota, and i have an issue ... very strange i don't know why. There are some holes in my metrics collection (all the time 5min), and that's something I don't have with other endpoints.
I don't think I made a configuration error, if you have an idea I'll take it