OpenCHAMI / smd

MIT License
0 stars 4 forks source link

[TEST] Test SMD within the HPE CSM pipeline to identify regressions #42

Open alexlovelltroy opened 3 weeks ago

alexlovelltroy commented 3 weeks ago

This is a placeholder for @mtupitsyn to add any errors he finds with running SMD at HPE.

mtupitsyn commented 3 weeks ago

Launched vShasta instance with SMD image overridden with ghcr.io/openchami/smd:2 (ghcr.io/openchami/smd:sha256-95fa016542c9fc68ef24281ce8b4a5594d3a958a0367fc365e9953476fc06bc4). PostgreSQL seems to be configured in a different way for OpenCHAMI version of SMD:

# kubectl -n services logs cray-smd-5c6c9cf95c-g5tvs
Version: 2.17.7
Git Commit: 17779950cac79bd885928516b6396a97c55d29bf
Build Time: 1730131725
Git Branch: HEAD
Git Tag: v2.17.7
Git State: clean
Build Host: fv-az777-136
Go Version: go1.23.2
Build User: runner
2024/11/08 02:29:56.994505 main.go:778: Starting... cray-smd-5c6c9cf95c-g5tvs 2.17.7 17779950cac79bd885928516b6396a97c55d29bf
2024/11/08 02:29:56.994582 main.go:779: Version: 2.17.7, Git Commit: 17779950cac79bd885928516b6396a97c55d29bf, Build Time: 1730131725, Git Branch: HEAD, Git Tag: v2.17.7, Git State: clean, Build Host: fv-az777-136, Go Version: go1.23.2, Build User: runner
2024/11/08 02:29:56.994739 main.go:819: Connecting to data store (Postgres)...
2024/11/08 02:29:57.008114 hmsds-postgres.go:280: Error: System table query failed: pq: relation "system" does not exist
2024/11/08 02:29:57.008162 hmsds-postgres.go:235: Error: Open(): Schema check failed: pq: relation "system" does not exist
2024/11/08 02:29:57.008287 main.go:855: DB Connection failed.  Retrying in 5 seconds
2024/11/08 02:30:02.020840 hmsds-postgres.go:280: Error: System table query failed: pq: relation "system" does not exist
2024/11/08 02:30:02.020878 hmsds-postgres.go:235: Error: Open(): Schema check failed: pq: relation "system" does not exist
2024/11/08 02:30:02.020954 main.go:855: DB Connection failed.  Retrying in 5 seconds

For HPE version of SMD, PostgresQL connection is configured via environment variables:

# kubectl -n services get deploy cray-smd -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    cray.io/service: cray-smd
    deployment.kubernetes.io/revision: "1"
    meta.helm.sh/release-name: cray-hms-smd
    meta.helm.sh/release-namespace: services
  creationTimestamp: "2024-11-07T20:29:21Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: cray-hms-smd
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cray-smd
    helm.sh/base-chart: cray-service-11.0.0
    helm.sh/chart: cray-hms-smd-7.1.18
  name: cray-smd
  namespace: services
  resourceVersion: "35347"
  uid: b35613aa-1da2-4105-82a3-f848ffe21897
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: cray-hms-smd
      app.kubernetes.io/name: cray-smd
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        service.cray.io/public: "true"
        sidecar.istio.io/proxyCPU: 10m
        sidecar.istio.io/proxyCPULimit: 1000m
        sidecar.istio.io/proxyMemory: 100Mi
        sidecar.istio.io/proxyMemoryLimit: 512Mi
        traffic.sidecar.istio.io/excludeOutboundPorts: 8082,9092,2181
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: cray-hms-smd
        app.kubernetes.io/name: cray-smd
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - istio-ingressgateway
                  - istio-ingressgateway-hmn
              namespaces:
              - istio-system
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - env:
        - name: SMD_DBHOST
          value: cray-smd-postgres
        - name: SMD_DBPORT
          value: "5432"
        - name: SMD_DBNAME
          value: hmsds
        - name: SMD_DBOPTS
        - name: SMD_DBUSER
          valueFrom:
            secretKeyRef:
              key: username
              name: hmsdsuser.cray-smd-postgres.credentials
        - name: SMD_DBPASS
          valueFrom:
            secretKeyRef:
              key: password
              name: hmsdsuser.cray-smd-postgres.credentials
        - name: RF_MSG_HOST
          value: cray-shared-kafka-kafka-bootstrap.services.svc.cluster.local:9092:cray-dmtf-resource-event
        - name: TLSCERT
        - name: TLSKEY
        - name: VAULT_ADDR
          value: http://cray-vault.vault:8200
        - name: VAULT_SKIP_VERIFY
          value: "true"
        - name: SMD_RVAULT
          value: "true"
        - name: SMD_WVAULT
          value: "true"
        - name: SMD_SLS_HOST
          value: http://cray-sls/v1
        - name: LOGLEVEL
          value: "2"
        - name: SMD_HWINVHIST_AGE_MAX_DAYS
          value: "365"
        - name: HMS_CONFIG_PATH
          value: /hms_config/hms_config.json
        - name: SMD_CA_URI
          valueFrom:
            configMapKeyRef:
              key: CA_URI
              name: smd-cacert-info
        - name: SMD_HBTD_HOST
          value: http://cray-hbtd/hmi/v1
        - name: GOMAXPROCS
          value: "8"
        - name: POSTGRES_HOST
          value: cray-smd-postgres
        - name: POSTGRES_PORT
          value: "5432"
        image: artifactory.algol60.net/csm-docker/stable/ghcr.io/openchami/smd:2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /hsm/v2/service/liveness
            port: 27779
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: cray-smd
        ports:
        - containerPort: 27779
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /hsm/v2/service/ready
            port: 27779
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          limits:
            cpu: "2"
            memory: 1Gi
          requests:
            cpu: 10m
            memory: 128Mi
        securityContext:
          runAsGroup: 65534
          runAsNonRoot: true
          runAsUser: 65534
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /hms_config/
          name: hms-config-vol
        - mountPath: /usr/local/cray-pki
          name: cray-pki-cacert-vol
      dnsPolicy: ClusterFirst
      priorityClassName: csm-high-priority-service
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: cray-configmap-ca-public-key
          optional: true
        name: cray-pki-cacert-vol
      - configMap:
          defaultMode: 420
          name: cray-hms-base-config
          optional: true
        name: hms-config-vol
status:
  conditions:
  - lastTransitionTime: "2024-11-07T20:29:21Z"
    lastUpdateTime: "2024-11-07T20:29:21Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2024-11-07T20:39:22Z"
    lastUpdateTime: "2024-11-07T20:39:22Z"
    message: ReplicaSet "cray-smd-5c6c9cf95c" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 1
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1 
mtupitsyn commented 3 weeks ago

Ok, it looks like error above was caused by an issue with database initialization. CSM Helm chart for SMD contains a job which initializes database, which uses SMD image. Attempt to run this job with OpenCHAMI SMD image failed, so database remained uninitialized (system table did not exist).

For now I decided to go different route - instead of deploying OpenCHAMI SMD image as part of modified CSM distro, I created regular CSM vShasta deployment (which initialized database) and then re-deployed SMD container (by issuing kubectl edit deploy cray-hms-smd command). OpenCHAMI SMD container started without an issue. Logs for CSM and OpenCHAMI versions look different due to formatting. I have attached pod logs.

One difference I noticed in logs is reaction to absence of TLS cert:

CSM:

2024/11/08 21:26:35.237349 tls.go:44: Cert or key path was the empty string
2024/11/08 21:26:35.237353 smd.go:921: Warning: TLS cert or key file missing, falling back to http

OpenCHAMI:

2024/11/08 21:08:59.921031 tls.go:47: generate Certs
2024/11/08 21:09:00.456981 tls.go:132: Failed to open /etc/cert.pem for writing: open /etc/cert.pem: permission denied!
2024/11/08 21:09:00.457023 tls.go:54: Error: Couldn't create https certs.
2024/11/08 21:09:00.457032 main.go:975: Warning: TLS cert or key file missing, falling back to http

I also ran smoke / functional tests provided by Cray HMS team. All tests passed for both CSM and OpenCHAMI version of container. Test logs are also in the attached file, but there's no much information - those are CT tests, ran in pods, which were deleted after test execution. Test definitions, as far as I could reverse-engineer HMS testing suite, is here:

alexlovelltroy commented 3 weeks ago

We have an smd-init container as well. It should function the same as the smd container

mtupitsyn commented 3 weeks ago

We have an smd-init container as well. It should function the same as the smd container

Sounds good, I'll look more into this on Monday. Unfortunately I didn't grab initialization job logs before I re-created system, so I don't know why it failed. By the way, it is not init container, it's a separate job (I suppose Helm pre-install hook or something like that).

alexlovelltroy commented 3 weeks ago

I think the difference in reporting for TLS is a non-issue. The reporting is better with the new logging library, but the functionality is the same.

You should review the way we handle smd initialization through our docker compose file. There is a job that updates the database as well as an smd container for execution. We may have changed the environment variables a little bit to keep them consistent across services.

The lack of failing tests is worrying though. We've definitely disabled internal discovery services and internal redfish listeners in SMD.

https://github.com/OpenCHAMI/deployment-recipes/blob/main/quickstart/openchami-svcs.yml#L2C1-L53C17

mtupitsyn commented 3 weeks ago

I see the difference in initialization sequence. CSM version of SMD initialization job supports persistence, meaning that during container startup, SQL migration scripts are copied from container filesystem to mounted persistent volume. I suppose this allows idempotent application of migration scripts during SMD upgrade.

To make database initialization successful, I've re-created job with the following changes:

  1. Container entrypoint changed from /entrypoint.sh smd-init to /smd-init. The entrypoint.sh script copies SQL migration scripts from migrations to /persistent_migrations in runtime: https://github.com/Cray-HPE/hms-smd/blob/master/entrypoint.sh#L35
  2. Mount point /persistent_migrations removed (as it would override /persistent_migrations directory coming with SMD container).

For tests which pass, I suppose we will need SME help, similar to BSS testing task.

alexlovelltroy commented 3 weeks ago

Thanks for reviewing the persistent_migrations->migrations change and succinctly describing the difference. It's a good breadcrumb for the SME to follow. As we move forward with HPE using OpenCHAMI as the upstream, I'd suggest that the OpenCHAMI method is correct and persistent_migrations is not needed, even in the case of an upgrade. I can't think of a case where having the migrations disconnected from the container is necessary.