Open alexlovelltroy opened 3 weeks ago
Launched vShasta instance with SMD image overridden with ghcr.io/openchami/smd:2
(ghcr.io/openchami/smd:sha256-95fa016542c9fc68ef24281ce8b4a5594d3a958a0367fc365e9953476fc06bc4
). PostgreSQL seems to be configured in a different way for OpenCHAMI version of SMD:
# kubectl -n services logs cray-smd-5c6c9cf95c-g5tvs
Version: 2.17.7
Git Commit: 17779950cac79bd885928516b6396a97c55d29bf
Build Time: 1730131725
Git Branch: HEAD
Git Tag: v2.17.7
Git State: clean
Build Host: fv-az777-136
Go Version: go1.23.2
Build User: runner
2024/11/08 02:29:56.994505 main.go:778: Starting... cray-smd-5c6c9cf95c-g5tvs 2.17.7 17779950cac79bd885928516b6396a97c55d29bf
2024/11/08 02:29:56.994582 main.go:779: Version: 2.17.7, Git Commit: 17779950cac79bd885928516b6396a97c55d29bf, Build Time: 1730131725, Git Branch: HEAD, Git Tag: v2.17.7, Git State: clean, Build Host: fv-az777-136, Go Version: go1.23.2, Build User: runner
2024/11/08 02:29:56.994739 main.go:819: Connecting to data store (Postgres)...
2024/11/08 02:29:57.008114 hmsds-postgres.go:280: Error: System table query failed: pq: relation "system" does not exist
2024/11/08 02:29:57.008162 hmsds-postgres.go:235: Error: Open(): Schema check failed: pq: relation "system" does not exist
2024/11/08 02:29:57.008287 main.go:855: DB Connection failed. Retrying in 5 seconds
2024/11/08 02:30:02.020840 hmsds-postgres.go:280: Error: System table query failed: pq: relation "system" does not exist
2024/11/08 02:30:02.020878 hmsds-postgres.go:235: Error: Open(): Schema check failed: pq: relation "system" does not exist
2024/11/08 02:30:02.020954 main.go:855: DB Connection failed. Retrying in 5 seconds
For HPE version of SMD, PostgresQL connection is configured via environment variables:
# kubectl -n services get deploy cray-smd -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
cray.io/service: cray-smd
deployment.kubernetes.io/revision: "1"
meta.helm.sh/release-name: cray-hms-smd
meta.helm.sh/release-namespace: services
creationTimestamp: "2024-11-07T20:29:21Z"
generation: 1
labels:
app.kubernetes.io/instance: cray-hms-smd
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: cray-smd
helm.sh/base-chart: cray-service-11.0.0
helm.sh/chart: cray-hms-smd-7.1.18
name: cray-smd
namespace: services
resourceVersion: "35347"
uid: b35613aa-1da2-4105-82a3-f848ffe21897
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: cray-hms-smd
app.kubernetes.io/name: cray-smd
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
service.cray.io/public: "true"
sidecar.istio.io/proxyCPU: 10m
sidecar.istio.io/proxyCPULimit: 1000m
sidecar.istio.io/proxyMemory: 100Mi
sidecar.istio.io/proxyMemoryLimit: 512Mi
traffic.sidecar.istio.io/excludeOutboundPorts: 8082,9092,2181
creationTimestamp: null
labels:
app.kubernetes.io/instance: cray-hms-smd
app.kubernetes.io/name: cray-smd
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- istio-ingressgateway
- istio-ingressgateway-hmn
namespaces:
- istio-system
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- env:
- name: SMD_DBHOST
value: cray-smd-postgres
- name: SMD_DBPORT
value: "5432"
- name: SMD_DBNAME
value: hmsds
- name: SMD_DBOPTS
- name: SMD_DBUSER
valueFrom:
secretKeyRef:
key: username
name: hmsdsuser.cray-smd-postgres.credentials
- name: SMD_DBPASS
valueFrom:
secretKeyRef:
key: password
name: hmsdsuser.cray-smd-postgres.credentials
- name: RF_MSG_HOST
value: cray-shared-kafka-kafka-bootstrap.services.svc.cluster.local:9092:cray-dmtf-resource-event
- name: TLSCERT
- name: TLSKEY
- name: VAULT_ADDR
value: http://cray-vault.vault:8200
- name: VAULT_SKIP_VERIFY
value: "true"
- name: SMD_RVAULT
value: "true"
- name: SMD_WVAULT
value: "true"
- name: SMD_SLS_HOST
value: http://cray-sls/v1
- name: LOGLEVEL
value: "2"
- name: SMD_HWINVHIST_AGE_MAX_DAYS
value: "365"
- name: HMS_CONFIG_PATH
value: /hms_config/hms_config.json
- name: SMD_CA_URI
valueFrom:
configMapKeyRef:
key: CA_URI
name: smd-cacert-info
- name: SMD_HBTD_HOST
value: http://cray-hbtd/hmi/v1
- name: GOMAXPROCS
value: "8"
- name: POSTGRES_HOST
value: cray-smd-postgres
- name: POSTGRES_PORT
value: "5432"
image: artifactory.algol60.net/csm-docker/stable/ghcr.io/openchami/smd:2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /hsm/v2/service/liveness
port: 27779
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: cray-smd
ports:
- containerPort: 27779
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /hsm/v2/service/ready
port: 27779
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 128Mi
securityContext:
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /hms_config/
name: hms-config-vol
- mountPath: /usr/local/cray-pki
name: cray-pki-cacert-vol
dnsPolicy: ClusterFirst
priorityClassName: csm-high-priority-service
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: cray-configmap-ca-public-key
optional: true
name: cray-pki-cacert-vol
- configMap:
defaultMode: 420
name: cray-hms-base-config
optional: true
name: hms-config-vol
status:
conditions:
- lastTransitionTime: "2024-11-07T20:29:21Z"
lastUpdateTime: "2024-11-07T20:29:21Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2024-11-07T20:39:22Z"
lastUpdateTime: "2024-11-07T20:39:22Z"
message: ReplicaSet "cray-smd-5c6c9cf95c" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing
observedGeneration: 1
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
Ok, it looks like error above was caused by an issue with database initialization. CSM Helm chart for SMD contains a job which initializes database, which uses SMD image. Attempt to run this job with OpenCHAMI SMD image failed, so database remained uninitialized (system
table did not exist).
For now I decided to go different route - instead of deploying OpenCHAMI SMD image as part of modified CSM distro, I created regular CSM vShasta deployment (which initialized database) and then re-deployed SMD container (by issuing kubectl edit deploy cray-hms-smd
command). OpenCHAMI SMD container started without an issue. Logs for CSM and OpenCHAMI versions look different due to formatting. I have attached pod logs.
One difference I noticed in logs is reaction to absence of TLS cert:
CSM:
2024/11/08 21:26:35.237349 tls.go:44: Cert or key path was the empty string
2024/11/08 21:26:35.237353 smd.go:921: Warning: TLS cert or key file missing, falling back to http
OpenCHAMI:
2024/11/08 21:08:59.921031 tls.go:47: generate Certs
2024/11/08 21:09:00.456981 tls.go:132: Failed to open /etc/cert.pem for writing: open /etc/cert.pem: permission denied!
2024/11/08 21:09:00.457023 tls.go:54: Error: Couldn't create https certs.
2024/11/08 21:09:00.457032 main.go:975: Warning: TLS cert or key file missing, falling back to http
I also ran smoke / functional tests provided by Cray HMS team. All tests passed for both CSM and OpenCHAMI version of container. Test logs are also in the attached file, but there's no much information - those are CT tests, ran in pods, which were deleted after test execution. Test definitions, as far as I could reverse-engineer HMS testing suite, is here:
We have an smd-init container as well. It should function the same as the smd container
We have an smd-init container as well. It should function the same as the smd container
Sounds good, I'll look more into this on Monday. Unfortunately I didn't grab initialization job logs before I re-created system, so I don't know why it failed. By the way, it is not init container, it's a separate job (I suppose Helm pre-install hook or something like that).
I think the difference in reporting for TLS is a non-issue. The reporting is better with the new logging library, but the functionality is the same.
You should review the way we handle smd initialization through our docker compose file. There is a job that updates the database as well as an smd container for execution. We may have changed the environment variables a little bit to keep them consistent across services.
The lack of failing tests is worrying though. We've definitely disabled internal discovery services and internal redfish listeners in SMD.
https://github.com/OpenCHAMI/deployment-recipes/blob/main/quickstart/openchami-svcs.yml#L2C1-L53C17
I see the difference in initialization sequence. CSM version of SMD initialization job supports persistence, meaning that during container startup, SQL migration scripts are copied from container filesystem to mounted persistent volume. I suppose this allows idempotent application of migration scripts during SMD upgrade.
To make database initialization successful, I've re-created job with the following changes:
/entrypoint.sh smd-init
to /smd-init
. The entrypoint.sh
script copies SQL migration scripts from migrations
to /persistent_migrations
in runtime: https://github.com/Cray-HPE/hms-smd/blob/master/entrypoint.sh#L35/persistent_migrations
removed (as it would override /persistent_migrations
directory coming with SMD container).For tests which pass, I suppose we will need SME help, similar to BSS testing task.
Thanks for reviewing the persistent_migrations->migrations change and succinctly describing the difference. It's a good breadcrumb for the SME to follow. As we move forward with HPE using OpenCHAMI as the upstream, I'd suggest that the OpenCHAMI method is correct and persistent_migrations
is not needed, even in the case of an upgrade. I can't think of a case where having the migrations disconnected from the container is necessary.
This is a placeholder for @mtupitsyn to add any errors he finds with running SMD at HPE.