Issue with gpfs_filesetquota metrics collect

SckyzO commented 6 months ago

Hello,

I use metrics_gpfs_filesetquota, and i have an issue ... very strange i don't know why. There are some holes in my metrics collection (all the time 5min), and that's something I don't have with other endpoints.

  - job_name: 'gpfs_filesetquota'
    scrape_interval: 5m
    scrape_timeout: 2m
    metrics_path: '/metrics_gpfs_filesetquota'
    static_configs:
      - targets: ['localhost:9250']
        labels:
          env: qualif
          cluster: gpfs
    honor_timestamps: True
    scheme: https
    tls_config:
      cert_file: /etc/bridge_ssl/certs/cert_brige_ibm.pem
      key_file: /etc/bridge_ssl/certs/privkey_brige_ibm.pem
      insecure_skip_verify: True

I don't think I made a configuration error, if you have an idea I'll take it

Helene commented 6 months ago

Hi @SckyzO,

you specified wrong scrape_interval in "gpfs_filesetquota" job. As described here, the scrape job interval must also match the sensor period configured by the IBM Storage Scale performance monitoring tool. Using endpoint /sensorsconfig you can check the active sensors settings. Usually the GPFSFilesetQuota sensor is running once per day.

[root@RHEL92-32 tmp]# curl https://<pod-ip>:9250/sensorsconfig -u scale_grafana:<apiKeyValue> -k | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2437  100  2437    0     0  71676      0 --:--:-- --:--:-- --:--:-- 71676
[
  {
    "name": "\"CPU\"",
    "period": "1"
  },
  {
    "name": "\"Load\"",
    "period": "1"
  },
  {
    "name": "\"Memory\"",
    "period": "1"
  },
  {
    "filter": "\"netdev_name=veth.*|docker.*|flannel.*|cali.*|cbr.*\"",
    "name": "\"Network\"",
    "period": "1"
  },
  {
    "name": "\"Netstat\"",
    "period": "10"
  },
  {
    "name": "\"Diskstat\"",
    "period": "0"
  },
  {
    "filter": "\"mountPoint=/.*/docker.*|/.*/kubelet.*\"",
    "name": "\"DiskFree\"",
    "period": "600"
  },
  {
    "name": "\"Infiniband\"",
    "period": "0"
  },
  {
    "name": "\"GPFSDisk\"",
    "period": "0"
  },
  {
    "name": "\"GPFSFilesystem\"",
    "period": "10"
  },
  {
    "name": "\"GPFSNSDDisk\"",
    "period": "10",
    "restrict": "\"nsdNodes\""
  },
  {
    "name": "\"GPFSPoolIO\"",
    "period": "0"
  },
  {
    "name": "\"GPFSVFSX\"",
    "period": "10"
  },
  {
    "name": "\"GPFSIOC\"",
    "period": "0"
  },
  {
    "name": "\"GPFSVIO64\"",
    "period": "0"
  },
  {
    "name": "\"GPFSPDDisk\"",
    "period": "10",
    "restrict": "\"nsdNodes\""
  },
  {
    "name": "\"GPFSvFLUSH\"",
    "period": "0"
  },
  {
    "name": "\"GPFSNode\"",
    "period": "10"
  },
  {
    "name": "\"GPFSNodeAPI\"",
    "period": "10"
  },
  {
    "name": "\"GPFSFilesystemAPI\"",
    "period": "10"
  },
  {
    "name": "\"GPFSLROC\"",
    "period": "0"
  },
  {
    "name": "\"GPFSCHMS\"",
    "period": "0"
  },
  {
    "name": "\"GPFSAFM\"",
    "period": "0"
  },
  {
    "name": "\"GPFSAFMFS\"",
    "period": "0"
  },
  {
    "name": "\"GPFSAFMFSET\"",
    "period": "0"
  },
  {
    "name": "\"GPFSRPCS\"",
    "period": "10"
  },
  {
    "name": "\"GPFSWaiters\"",
    "period": "10"
  },
  {
    "name": "\"GPFSFilesetQuota\"",
    "period": "3600",
    "restrict": "\"@CLUSTER_PERF_SENSOR\""
  },
  {
    "name": "\"GPFSFileset\"",
    "period": "300",
    "restrict": "\"@CLUSTER_PERF_SENSOR\""
  },
  {
    "name": "\"GPFSPool\"",
    "period": "300",
    "restrict": "\"@CLUSTER_PERF_SENSOR\""
  },
  {
    "name": "\"GPFSDiskCap\"",
    "period": "86400",
    "restrict": "\"@CLUSTER_PERF_SENSOR\""
  },
  {
    "name": "\"GPFSEventProducer\"",
    "period": "0"
  },
  {
    "name": "\"GPFSMutex\"",
    "period": "0"
  },
  {
    "name": "\"GPFSCondvar\"",
    "period": "0"
  },
  {
    "name": "\"TopProc\"",
    "period": "60"
  },
  {
    "name": "\"GPFSQoS\"",
    "period": "0"
  },
  {
    "name": "\"GPFSFCM\"",
    "period": "0"
  },
  {
    "name": "\"GPFSBufMgr\"",
    "period": "30"
  },
  {
    "name": "\"NFSIO\"",
    "period": "0",
    "restrict": "\"cesNodes\"",
    "type": "\"Generic\""
  },
  {
    "name": "\"SMBStats\"",
    "period": "0",
    "restrict": "\"cesNodes\"",
    "type": "\"Generic\""
  },
  {
    "name": "\"SMBGlobalStats\"",
    "period": "0",
    "restrict": "\"cesNodes\"",
    "type": "\"Generic\""
  },
  {
    "name": "\"CTDBStats\"",
    "period": "0",
    "restrict": "\"cesNodes\"",
    "type": "\"Generic\""
  },
  {
    "name": "\"CTDBDBStats\"",
    "period": "0",
    "restrict": "\"cesNodes\"",
    "type": "\"Generic\""
  }
]

SckyzO commented 6 months ago

Ahhhh! I didn't know this endpoint existed :-) I couldn't find any information on endpoint periode This issue is a good point to add information to the documentation :P

Thank you @Helene :-)

SckyzO commented 6 months ago

Huuuu ... last question, you say gpfsfilesetquota sensor is running once per day, but if I read the output, i see 3600 (1h ?), right ?

  {
    "name": "\"GPFSFilesetQuota\"",
    "period": "3600",
    "restrict": "\"@CLUSTER_PERF_SENSOR\""
  },

Because I had the same "issue" with 3600s ...

I don't have a straight curve, which means that graphana dashboards cannot be used with these metrics. Do you see what I mean?

Helene commented 6 months ago

Huuuu ... last question, you say gpfsfilesetquota sensor is running once per day, but if I read the output, i see 3600 (1h ?), right ?
  {
    "name": "\"GPFSFilesetQuota\"",
    "period": "3600",
    "restrict": "\"@CLUSTER_PERF_SENSOR\""
  },
Because I had the same "issue" with 3600s ...

Sorry you are right, it is running "hourly", not daily.

It looks like Quota management is not enabled on your filesystem. You can it easily check by executing the following command on any gpfs node:

[root@scale-11 ~]# mmcheckquota localFS
localFS: quota management is not enabled, or one or more quota clients are not available.
mmcheckquota: Command failed. Examine previous error messages to determine cause.

In this case you need to execute

# mmchfs localFS -Q yes --perfileset-quota

For more about Quota Management please read https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=quotas-enabling-disabling-gpfs-quota-management#instq

Note: After Enabling Quota, it can take a half hour before the sensor starts to collect data

SckyzO commented 6 months ago

But ... my quotas are already enabled :)

[root@compute002 ~]# mmlsfs gpfs -Q --perfileset-quota
flag                value                    description
------------------- ------------------------ -----------------------------------
 -Q                 user;group;fileset       Quotas accounting enabled
                    user;group;fileset       Quotas enforced
                    user;group;fileset       Default quotas enabled
 --perfileset-quota yes                      Per-fileset quota enforcement

I don't think it is a gpfs issue, others metrics works perfectly, just this is in "error"

Helene commented 6 months ago

on my system it works fine. Check if the zimon is returning data by executing the following command on pmcollector node

# time echo "get group GPFSFilesetQuota bucket_size 3600 last 5" | /opt/IBM/zimon/zc

SckyzO commented 6 months ago

Yes, this command return many data ...

In prometheus WebUI, if you request metrics : gpfs_rq_blk_current, in graph mode, like me, do you have a straight curve ? or do you have a hatched/cut line (like me) ?

Helene commented 6 months ago

try to update metadata

[root@RHEL92-32 ~]# curl https://<pod-ip>:9250/update -u scale_grafana:<apiKeyValue> -k | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    42  100    42    0     0    162      0 --:--:-- --:--:-- --:--:--   163
{
  "msg": "Successfully retrieved MetaData"
}

IBM / ibm-spectrum-scale-bridge-for-grafana

Issue with gpfs_filesetquota metrics collect #191