dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 148 forks source link

Readiness/Liveness probes do not accept integer port #778

Open mmourafiq opened 1 year ago

mmourafiq commented 1 year ago

Describe the issue:

Although the specification of the cluster is suggesting int_or_type, using integer probes raises an error, here's an example based on the documentation where the port http_dashboard is 8786, basically:

              readinessProbe:
                httpGet:
                  port: http-dashboard
                  path: /health
                initialDelaySeconds: 5
                periodSeconds: 10
              livenessProbe:
                httpGet:
                  port: http-dashboard
                  path: /health
                initialDelaySeconds: 15
                periodSeconds: 20

is replaced with this:

              readinessProbe:
                httpGet:
                  port: 8786
                  path: /health
                initialDelaySeconds: 5
                periodSeconds: 10
              livenessProbe:
                httpGet:
                  port: 8786
                  path: /health
                initialDelaySeconds: 15
                periodSeconds: 20

If you check the type definition of the probes, e.g. python definition https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1HTTPGetAction.md, you will notice that it's of type object and accepts string or integer, here's also the kubernetes docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

Full example:

apiVersion: kubernetes.dask.org/v1
kind: DaskJob
metadata:
  name: simple-job
  namespace: default
spec:
  job:
    spec:
      containers:
        - name: job
          image: "ghcr.io/dask/dask:latest"
          imagePullPolicy: "IfNotPresent"
          args:
            - python
            - -c
            - "from dask.distributed import Client; client = Client(); # Do some work..."

  cluster:
    spec:
      worker:
        replicas: 2
        spec:
          containers:
            - name: worker
              image: "ghcr.io/dask/dask:latest"
              imagePullPolicy: "IfNotPresent"
              args:
                - dask-worker
                - --name
                - $(DASK_WORKER_NAME)
                - --dashboard
                - --dashboard-address
                - "8788"
              ports:
                - name: http-dashboard
                  containerPort: 8788
                  protocol: TCP
              env:
                - name: WORKER_ENV
                  value: hello-world # We dont test the value, just the name
      scheduler:
        spec:
          containers:
            - name: scheduler
              image: "ghcr.io/dask/dask:latest"
              imagePullPolicy: "IfNotPresent"
              args:
                - dask-scheduler
              ports:
                - name: tcp-comm
                  containerPort: 8786
                  protocol: TCP
                - name: http-dashboard
                  containerPort: 8787
                  protocol: TCP
              readinessProbe:
                httpGet:
                  port: 8786
                  path: /health
                initialDelaySeconds: 5
                periodSeconds: 10
              livenessProbe:
                httpGet:
                  port: 8786
                  path: /health
                initialDelaySeconds: 15
                periodSeconds: 20
              env:
                - name: SCHEDULER_ENV
                  value: hello-world
        service:
          type: ClusterIP
          selector:
            dask.org/cluster-name: simple-job
            dask.org/component: scheduler
          ports:
            - name: tcp-comm
              protocol: TCP
              port: 8786
              targetPort: "tcp-comm"
            - name: http-dashboard
              protocol: TCP
              port: 8787
              targetPort: "http-dashboard"

Anything else we need to know?:

The error during the submission:

spec.cluster.spec.scheduler.spec.containers[0].readinessProbe.httpGet.port: Invalid value: "integer": spec.cluster.spec.scheduler.spec.containers[0].readinessProbe.httpGet.port in body must be of type string: "integer"

Environment:

jacobtomlinson commented 1 year ago

Thanks @mmourafiq. I'm not 100% sure where to look to resolve this because DaskJob.spec.cluster.spec.scheduler.spec is just an io.k8s.api.core.v1.PodSpec and should be validated exactly the same as any other Pod spec. We use k8s-crd-resolver to generate the CRDs from our templates.

I note that our CRD templates are referencing the Kubernetes 1.21.1 spec so perhaps bumping those to a more recent version would help?

https://github.com/dask/dask-kubernetes/blob/2c48b6e288e1ba523522ff015c05445f5dc3733a/dask_kubernetes/operator/customresources/templates.yaml#L48

mmourafiq commented 1 year ago

I see, I think the issue is in the k8s-crd-resolver, the choice of intOrString from the machinery is probably not the correct one. I already tried port: "8786" but the issue with that is that kubernetes would automatically try to resolve a port name if the value is string, i.e. it would complain about the value not starting with a charter value when there are quotes.

jacobtomlinson commented 1 year ago

We can patch things and do in a couple of places already. Maybe we should do that here? Do you know what type should it be instead ofintOrString?

https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/customresources/daskcluster.patch.yaml

mmourafiq commented 1 year ago

Sorry for late reply, I just checked again the generated CRD from kubebuilder, and indeed intOrString is the correct one. But the type needs to change from string to:

anyOf:
  - type: integer
  - type: string

Not sure if this is supported, but here's the full generated spec:

port:
  anyOf:
  - type: integer
  - type: string
  description: Name or number of the port to access
    on the container. Number must be in the range
    1 to 65535. Name must be an IANA_SVC_NAME.
  x-kubernetes-int-or-string: true

Hope this helps.

P.S. I reworked the converter in our application to use the port name string instead of the port value int, but the issue could happen to other users and it will easily take ~ an hour of debugging :)