IBM / ibm-spectrum-scale-bridge-for-grafana

This tool allows the IBM Storage Scale users to perform performance monitoring for IBM Storage Scale devices using third-party applications such as Grafana or Prometheus software.
Apache License 2.0
30 stars 17 forks source link

Multi pmcollector setup #213

Open gpplo opened 4 months ago

gpplo commented 4 months ago

Hi, I have 2 clusters. One with single pmcollector, the second one with 4. I'm testing the prometheus exporter. It works on the single pmcollector cluster, it doesn't on the one with 4.

Did you tests cover the multi-pmcollector setup?

Helene commented 4 months ago

Hi @gpplo, I'm working only with environments having at least 2 pmcollectors in a federated mode. What exactly does not work in your environment?

gpplo commented 4 months ago

Hi Helene, good, I just wanted to rule out this. We are trying the exporter with a cluster with 4 pmcollectors, the exporter exits like this

2024-05-14 15:18 - MainThread - INFO - IBM Storage Scale bridge for Grafana - Version: 8.0.0 2024-05-14 15:18 - MainThread - ERROR - QueryHandler: getTopology returns no data. 2024-05-14 15:18 - MainThread - WARNING - No Metadata results received from the pmcollector. Start retry attempt 1 in 60s (MAX_ATTEMPTS_COUNT:3) 2024-05-14 15:19 - MainThread - ERROR - QueryHandler: getTopology returns no data. 2024-05-14 15:19 - MainThread - WARNING - No Metadata results received from the pmcollector. Start retry attempt 2 in 60s (MAX_ATTEMPTS_COUNT:3) 2024-05-14 15:20 - MainThread - ERROR - QueryHandler: getTopology returns no data. 2024-05-14 15:20 - MainThread - WARNING - No Metadata results received from the pmcollector. Start retry attempt 3 in 60s (MAX_ATTEMPTS_COUNT:3) %s Server internal error occurred. Reason: Empty results received 2024-05-14 15:21 - MainThread - ERROR - Metadata could not be retrieved. Check log file for more details, quitting

On the collector side I see these warnings

May-14 15:18:03 [Warning] FedCtrl: received unexpected NGetMetadataRep reply (peerID=2, queryID=1661) ... May-14 15:18:06 [Warning] QueryEngine: searchForMetric: could not find metaKey for given metric gpfs_disk_disksize, returning. May-14 15:18:06 [Warning] QueryEngine: searchForMetric: could not find metaKey for given metric gpfs_disk_free_fullkb, returning. May-14 15:18:06 [Warning] QueryEngine: searchForMetric: could not find metaKey for given metric gpfs_disk_free_fragkb, returning. ...

Helene commented 4 months ago

Hi @gpplo, the communiction channel works as follow: bridge > sysmon daemon > pmcollector. When the bridge starts and the connection to sysmon works fine, the first what the bridge requests from pmcollector is the Metadata. From the error above I see that the bridge did receive empty result back. Now we need to verify where is the bottleneck on your system, is it pmcollector itself or might be sysmon daemon. Couple of questions I need info from you:

  1. GPFS version
  2. RedHat or used os version
  3. Are all 4 pmcollectors configured as peers?
  4. If you run 'mmsysmonc noderoles list' on the node, where the pmcollector is running, do you see PMCollectorNode role in the output? Example:
    
    [root@my-pmcollector-node ~]# mmsysmonc noderoles list
    ### Node Roles ###
    ------------------------------
    - CallhomeNode
    - CesManagerNode
    - ClusterManagerNode
    - FileSystemMgrNode
    - GUINode
    - LinuxNode
    - ManagerNode
    - NSDServer
    - PMCollectorNode
    - PMSensorsNode
    - QuorumNode
    - SpectrumScaleServer
    - ThresholdNode

Active Monitors


Probably I need to check sysmon+pmcollector logs on your system. On the pmcollector node, run the following command to increase sysmon traces

# mmsysmonc d setopt logging buffer False

Start the bridge and wait until it fails. Collect and compress the log from: /var/log/ibm_bridge_for_grafana /var/log/zimon /var/adm/ras/ only *the latest mmsysmonitor..log

Don't forget to decrease sysmon traces

# mmsysmonc d setopt logging buffer True

You can open a ticket by IBM Storage Scale support with reference to this issue and my name, and upload the collected packages there.