Open gpplo opened 4 months ago
Hi @gpplo, I'm working only with environments having at least 2 pmcollectors in a federated mode. What exactly does not work in your environment?
Hi Helene, good, I just wanted to rule out this. We are trying the exporter with a cluster with 4 pmcollectors, the exporter exits like this
2024-05-14 15:18 - MainThread - INFO - IBM Storage Scale bridge for Grafana - Version: 8.0.0 2024-05-14 15:18 - MainThread - ERROR - QueryHandler: getTopology returns no data. 2024-05-14 15:18 - MainThread - WARNING - No Metadata results received from the pmcollector. Start retry attempt 1 in 60s (MAX_ATTEMPTS_COUNT:3) 2024-05-14 15:19 - MainThread - ERROR - QueryHandler: getTopology returns no data. 2024-05-14 15:19 - MainThread - WARNING - No Metadata results received from the pmcollector. Start retry attempt 2 in 60s (MAX_ATTEMPTS_COUNT:3) 2024-05-14 15:20 - MainThread - ERROR - QueryHandler: getTopology returns no data. 2024-05-14 15:20 - MainThread - WARNING - No Metadata results received from the pmcollector. Start retry attempt 3 in 60s (MAX_ATTEMPTS_COUNT:3) %s Server internal error occurred. Reason: Empty results received 2024-05-14 15:21 - MainThread - ERROR - Metadata could not be retrieved. Check log file for more details, quitting
On the collector side I see these warnings
May-14 15:18:03 [Warning] FedCtrl: received unexpected NGetMetadataRep reply (peerID=2, queryID=1661) ... May-14 15:18:06 [Warning] QueryEngine: searchForMetric: could not find metaKey for given metric gpfs_disk_disksize, returning. May-14 15:18:06 [Warning] QueryEngine: searchForMetric: could not find metaKey for given metric gpfs_disk_free_fullkb, returning. May-14 15:18:06 [Warning] QueryEngine: searchForMetric: could not find metaKey for given metric gpfs_disk_free_fragkb, returning. ...
Hi @gpplo, the communiction channel works as follow: bridge > sysmon daemon > pmcollector. When the bridge starts and the connection to sysmon works fine, the first what the bridge requests from pmcollector is the Metadata. From the error above I see that the bridge did receive empty result back. Now we need to verify where is the bottleneck on your system, is it pmcollector itself or might be sysmon daemon. Couple of questions I need info from you:
PMCollectorNode
role in the output?
Example:
[root@my-pmcollector-node ~]# mmsysmonc noderoles list
### Node Roles ###
------------------------------
- CallhomeNode
- CesManagerNode
- ClusterManagerNode
- FileSystemMgrNode
- GUINode
- LinuxNode
- ManagerNode
- NSDServer
- PMCollectorNode
- PMSensorsNode
- QuorumNode
- SpectrumScaleServer
- ThresholdNode
Probably I need to check sysmon+pmcollector logs on your system. On the pmcollector node, run the following command to increase sysmon traces
# mmsysmonc d setopt logging buffer False
Start the bridge and wait until it fails.
Collect and compress the log from:
/var/log/ibm_bridge_for_grafana
/var/log/zimon
/var/adm/ras/ only *the latest mmsysmonitor.
Don't forget to decrease sysmon traces
# mmsysmonc d setopt logging buffer True
You can open a ticket by IBM Storage Scale support with reference to this issue and my name, and upload the collected packages there.
Hi, I have 2 clusters. One with single pmcollector, the second one with 4. I'm testing the prometheus exporter. It works on the single pmcollector cluster, it doesn't on the one with 4.
Did you tests cover the multi-pmcollector setup?