galexrt / dellhw_exporter

Prometheus exporter for Dell Hardware components using Dell OMSA.
https://dellhw-exporter.galexrt.moe
Apache License 2.0
119 stars 41 forks source link

`DaemonSet.apps "dellhw-exporter" is invalid: spec.template.spec.containers[0].name: Invalid value: "dellhw_exporter"` #131

Closed eugene-marchanka closed 2 months ago

eugene-marchanka commented 2 months ago

Can not deploy daemonset to k8s v1.26.1

I think issue is here: https://github.com/galexrt/dellhw_exporter/blob/7de162ee95cebebe9368aaa55aa9690776c8d483/charts/dellhw_exporter/templates/daemonset.yaml#L31

Error output:

DaemonSet.apps "dellhw-exporter" is invalid: spec.template.spec.containers[0].name: Invalid value: "dellhw_exporter": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')

My values.yaml file:

# Default values for dellhw_exporter.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

image:
  # -- Image repository
  repository: quay.io/galexrt/dellhw_exporter
  # -- Override the `imagePullPolicy`
  pullPolicy: IfNotPresent
  # -- Overrides the image tag whose default is the chart appVersion.
  tag: ""

# -- ImagePullSecrets to add to the DaemonSet
imagePullSecrets: []

nameOverride: "dellhw-exporter"
fullnameOverride: "dellhw-exporter"

serviceAccount:
  # -- Specifies whether a service account should be created
  create: true
  # -- Annotations to add to the service account
  annotations: {}
  # -- If not set and create is true, a name is generated using the fullname template
  # -- The name of the service account to use.
  name: ""

# -- Annotations to add to the Pods created by the DaemonSet
podAnnotations: {}
# -- Additional labels to add to the Pods created by the DaemonSet
podLabels: {}

# -- Kubernetes PodSecurityContext for the Pods
podSecurityContext: {}
# fsGroup: 2000

# -- SecurityContext for the container
securityContext:
  privileged: true
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

psp:
  # -- Specifies whether a PodSecurityPolicy (PSP) should be created
  create: false
  # -- PodSecurityPolicy spec
  spec:
    privileged: true
    allowedHostPaths: []
    volumes:
    - secret

service:
  type: ClusterIP
  port: 9137

resources:
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi

# -- NodeSelector for the DaemonSet
nodeSelector: {}

# -- Tolerations for the DaemonSet
tolerations: []

# -- Affinity for the DaemonSet
affinity: {}

serviceMonitor:
  # -- Specifies whether a prometheus-operator ServiceMonitor should be created
  enabled: true
  # -- Additional Labels for the ServiceMonitor object
  additionalLabels: {}
  namespace: "monitoring"
  namespaceSelector:
  # Default: scrape .Release.Namespace only
  # To scrape all, use the following:
  #  matchNames:
  #    - monitoring
  #   any: true
  scrapeInterval: 30s
  # honorLabels: true

prometheusRule:
  # -- Specifies whether a prometheus-operator PrometheusRule should be created
  enabled: true
  # -- Additional Labels for the PrometheusRule object
  additionalLabels: {}
  # Default: .Release.Namespace
  # namespace: ""
  # prometheusRule.rules -- Checkout the `/contrib/prometheus-alerts/prometheus-alerts.yml` file for example alerts
  rules:
  - alert: DellHardwarePowerSupplyFailure
    annotations:
      message: "Power supply {{ $labels.id }} status has failed for node {{ $labels.instance }}"
    expr: |
      dell_hw_ps_status > 0
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareDiskFailure
    annotations:
      message: "Physical Disk {{ $labels.disk }} on controller {{ $labels.controller_name }} status has failed for node {{ $labels.instance }}"
    expr: |
      dell_hw_storage_pdisk_status > 0
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareMemoryFailure
    annotations:
      message: "Memory {{ $labels.memory }} status has failed for node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_memory_status > 0
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareFanFailure
    annotations:
      message: "Fan {{ $labels.fan }} status has failed for node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_fan_status > 0
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareProcessorFailure
    annotations:
      message: "Fan {{ $labels.processor }} status has failed for node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_processor_status > 0
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareCPUTempFailure
    annotations:
      message: "Temperature {{ $labels.component }} is above threshold of 102C for node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_temps_reading{component=~"CPU.*"} > 102
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareGPUTempFailure
    annotations:
      message: "Temperature {{ $labels.component }} is above threshold of 102C on node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_temps_reading{component=~"System_Board_GPU.*"} > 102
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareSystemBoardExhaustTempFailure
    annotations:
      message: "Temperature {{ $labels.component }} is above threshold of 70C on node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_temps_reading{component="System_Board_Exhaust_Temp"} > 70
    for: 1m
    labels:
      severity: critical
  - alert: DellHardwareSystemBoardInletTempFailure
    annotations:
      message: "Temperature {{ $labels.component }} is above threshold of 45C on node {{ $labels.instance }}"
    expr: |
      dell_hw_chassis_temps_reading{component="System_Board_Inlet_Temp"} > 45
    for: 1m
    labels:
      severity: critical
eugene-marchanka commented 2 months ago

https://github.com/galexrt/dellhw_exporter/pull/132

eugene-marchanka commented 2 months ago

fixed ^