IBM / ibm-spectrum-scale-bridge-for-grafana

This tool allows the IBM Storage Scale users to perform performance monitoring for IBM Storage Scale devices using third-party applications such as Grafana or Prometheus software.
Apache License 2.0
30 stars 17 forks source link

K8S deploy failure #214

Closed hunter44321 closed 4 months ago

hunter44321 commented 4 months ago

I'm trying to run grafana_bridge v8.0.0 in a k8s environment using the following deployment.yaml: (truncated)

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ibm-grafana-bridge-test-zimon-config
  namespace: monitoring
data:
  ZIMonSensors.cfg: |
    XXXXXXXX
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ibm-grafana-bridge-tls-config
  namespace: test
data:
  privkey.pem: |
    -----BEGIN PRIVATE KEY-----
    XXXX
    -----END PRIVATE KEY-----
  cert.pem: |
    -----BEGIN CERTIFICATE-----
    XXXX
    -----END CERTIFICATE-----
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ibm-grafana-bridge
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ibm-grafana-bridge
  template:
    metadata:
      labels:
        app: ibm-grafana-bridge
    spec:
      containers:
      - name: ibm-grafana-bridge-test
        image: XXXXX/ibm_grafana_bridge:8.0.0
        resources:
          requests:
            memory: "100Mi"
            cpu: "100m"
          limits:
            memory: "4Gi"
            cpu: "4"
        ports:
        - name: test-port
          containerPort: 9250
        env:
        - name: SERVER
          value: "XXXX"
        - name: APIKEYVALUE
          value: "XXXX"
        - name: PROMETHEUS
          value: "9250"
        - name: TLSKEYPATH
          value: "/etc/bridge_ssl/certs"
        - name: PORT
          value: "4343"
        - name: TLSKEYFILE
          value: "privkey.pem"
        - name: TLSCERTFILE
          value: "cert.pem"
        - name: BASICAUTHPASSW
          value: "MTExMTExCg=="
        - name: BASICAUTH
          value: "False"
        volumeMounts:
        - name: logfiles
          mountPath: /var/log/ibm_bridge_for_grafana/
        - name: config-volume-test
          mountPath: /opt/IBM/zimon/ZIMonSensors.cfg
          subPath: ZIMonSensors.cfg
        - name: ibm-grafana-bridge-tls-volume
          mountPath: /etc/bridge_ssl/certs
          subPath: certs
      volumes:
      - name: logfiles
        emptyDir: {}
      - name: config-volume-test
        configMap:
          name: ibm-grafana-bridge-test-zimon-config
      - name: ibm-grafana-bridge-tls-volume
        configMap:
          name: ibm-grafana-bridge-tls-config
---
apiVersion: v1
kind: Service
metadata:
  name: ibm-grafana-bridge-test-service
  namespace: test
spec:
  selector:
    app: ibm-grafana-bridge
  ports:
    - protocol: TCP
      port: 9100
      targetPort: test-port

it works fine, logs:

2024-05-16 14:06 - MainThread                               - INFO     -  *** IBM Storage Scale bridge for Grafana - Version: 8.0.0-dev ***
2024-05-16 14:06 - MainThread                               - INFO     - Successfully retrieved MetaData
2024-05-16 14:06 - MainThread                               - INFO     - Received sensors:DiskFree, GPFSXXXXX,XXXXX,XXXX
2024-05-16 14:06 - MainThread                               - INFO     - Initial cherryPy server engine start have been invoked. Python version: 3.9.18 (main, Jan  4 2024, 00:00:00) 
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)], cherryPy version: 18.9.0.

but after that it reboots without any logs. what can be done in this case? Do you have any additional flags to raise the log level for debugging purposes?

Helene commented 4 months ago

hi @hunter44321,

Regarding log traces management... it is managed by the -c parameter

# python3 zimonGrafanaIntf.py -h
usage: python zimonGrafanaIntf.py [-h] [-s SERVER] [-P {9980,9981}] [-l LOGPATH] [-f LOGFILE] [-c LOGLEVEL] [-e PROMETHEUS] [-p PORT] [-r {http,https}]
                                  [-u USERNAME] [-a [PASSWORD]] [-t TLSKEYPATH] [-k TLSKEYFILE] [-m TLSCERTFILE] [-n APIKEYNAME] [-v [APIKEYVALUE]] [-d {yes,no}]

optional arguments:
  -h, --help            show this help message and exit
  -s SERVER, --server SERVER
                        Host name or ip address of the ZIMon collector (Default from config.ini: 127.0.0.1)
  -P {9980,9981}, --serverPort {9980,9981}
                        ZIMon collector port number (Default from config.ini: 9980)
  -l LOGPATH, --logPath LOGPATH
                        location path of the log file (Default from config.ini: '/var/log/ibm_bridge_for_grafana')
  -f LOGFILE, --logFile LOGFILE
                        Name of the log file (Default from config.ini: zserver.log). If no log file name specified all traces will be printed out directly on the
                        command line
  -c LOGLEVEL, --logLevel LOGLEVEL
                        log level. Available levels: 10 (DEBUG), 15 (MOREINFO), 20 (INFO), 30 (WARN), 40 (ERROR) (Default from config.ini: 15)
  -e PROMETHEUS, --prometheus PROMETHEUS
                        port number listening on Prometheus HTTPS connections (Default from config.ini: 9250, if enabled)
  -p PORT, --port PORT  port number listening on OpenTSDB API HTTP(S) connections (Default from config.ini: 4242, if enabled)
  -r {http,https}, --protocol {http,https}
                        Connection protocol HTTP/HTTPS (Default from config.ini: "http")
  -u USERNAME, --username USERNAME
                        HTTP/S basic authentication user name(Default from config.ini: 'scale_admin')
  -a [PASSWORD], --password [PASSWORD]
                        Enter your HTTP/S basic authentication password:
  -t TLSKEYPATH, --tlsKeyPath TLSKEYPATH
                        Directory path of tls privkey.pem and cert.pem file location (Required only for HTTPS ports 8443/9250)
  -k TLSKEYFILE, --tlsKeyFile TLSKEYFILE
                        Name of TLS key file, f.e.: privkey.pem (Required only for HTTPS ports 8443/9250)
  -m TLSCERTFILE, --tlsCertFile TLSCERTFILE
                        Name of TLS certificate file, f.e.: cert.pem (Required only for HTTPS ports 8443/9250)
  -n APIKEYNAME, --apiKeyName APIKEYNAME
                        Name of api key file (Default from config.ini: 'scale_grafana')
  -v [APIKEYVALUE], --apiKeyValue [APIKEYVALUE]
                        Enter your apiKey value:
  -d {yes,no}, --includeDiskData {yes,no}
                        Use or not the historical data from disk (Default from config.ini: "no")

For a container version it is already set to the debug level in the Dockerfile

From the log snippets you provided I see: 1) You are using the development version of grafana-bridge

*** IBM Storage Scale bridge for Grafana - Version: 8.0.0-dev ***

please use the released version https://github.com/IBM/ibm-spectrum-scale-bridge-for-grafana/releases/tag/v.8.0.0 2) The cherrypy server did not start

In your Deployment script you have mounted grafana-bridge logs to the localhost logfiles directory.

        volumeMounts:
        - name: logfiles
          mountPath: /var/log/ibm_bridge_for_grafana/
        - name: config-volume-test
          mountPath: /opt/IBM/zimon/ZIMonSensors.cfg
          subPath: ZIMonSensors.cfg
        - name: ibm-grafana-bridge-tls-volume
          mountPath: /etc/bridge_ssl/certs
          subPath: certs
      volumes:
      - name: logfiles
        emptyDir: {}

Please check if you have cherrypy_error.log there. Send me on email all logs present in this directory.

Before we start checking your deployment scripts, we can build run and verify the grafana-bridge container itself on your system. For example I'm running grafana-bridge pod via podman

[root@RHEL92-32 ~]# podman run -dt -p 4242:4242,9250:9250 -e "SERVER=9.15X.XXX.XX8" -e "APIKEYVALUE=c4824386-XXXX-XXXX-XXXX-XXXXXc11d" -e "PORT=4242" -e "PROMETHEUS=9250" -e "PROTOCOL=http" -e "BASICAUTH=False" -e "TLSKEYFILE=privkey.pem" -e "TLSCERTFILE=cert.pem" -v /tmp:/var/log/ibm_bridge_for_grafana --mount type=bind,src=/home/zimon/ZIMonSensors.cfg,target=/opt/IBM/zimon/ZIMonSensors.cfg,ro=true --pod new:my-bridge-basic-auth-test-pod --name bridge-basic-auth-test scale_bridge:test_8.0.0_dev
7ea4f2ee7f1e04b4d8d6a3ae394431d7e72370ce43c0334361daeecd2573dc3d

Since the grafan-bridge writes traces to stdout during the startup, I can check them via podman logs

[root@RHEL92-32~]# podman logs bridge-basic-auth-test
2024-05-16 18:08 - MainThread                               - INFO     -  *** IBM Storage Scale bridge for Grafana - Version: 8.0.0-dev ***
2024-05-16 18:08 - MainThread                               - INFO     - Successfully retrieved MetaData
2024-05-16 18:08 - MainThread                               - INFO     - Received sensors:CPU, DiskFree, GPFSBufMgr, GPFSFilesystem, GPFSFilesystemAPI, GPFSNSDDisk, GPFSNSDFS, GPFSNSDPool, GPFSNode, GPFSNodeAPI, GPFSRPCS, GPFSVFSX, GPFSWaiters, IPFIX, Load, Memory, Netstat, Network, TopProc, CTDBDBStats, CTDBStats, NFSIO, SMBGlobalStats, SMBStats, GPFSDiskCap, GPFSFileset, GPFSInodeCap, GPFSPool, GPFSPoolCap
2024-05-16 18:08 - MainThread                               - INFO     - Initial cherryPy server engine start have been invoked. Python version: 3.9.18 (main, Jan  4 2024, 00:00:00)
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)], cherryPy version: 18.9.0.
2024-05-16 18:08 - MainThread                               - INFO     - Registered applications:
 OpenTSDB Api listening on Grafana queries,
 Prometheus Exporter Api listening on Prometheus requests
2024-05-16 18:08 - MainThread                               - INFO     - server started

or I can check the /tmp directory where I did mount the logs by starting the pod

[root@RHEL92-32 ~]# cd /tmp
[root@RHEL92-32 tmp]# ls
cherrypy_access.log  cherrypy_error.log  zserver.log

This article might be helpful for you by setting up the grafana-bridge single pod.

It is a little bit effort but in this way we can determine if the problem is grafana-bridge itself or deployment style.

hunter44321 commented 4 months ago

thank you, your instructions helped it was my fault:

        - name: ibm-grafana-bridge-tls-volume
          mountPath: /etc/bridge_ssl/certs
          subPath: certs <----------------------------- the deployment has been fixed by removing this line.