dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
312 stars 148 forks source link

Operator fails if nodeport out of allowed range #536

Open jacobtomlinson opened 2 years ago

jacobtomlinson commented 2 years ago

If a DaskCluster is configured with a NodePort service but the ports are out of range the DaskCluster will be created but the controller logs will error repeatedly.

HTTP response headers: <CIMultiDictProxy('Audit-Id': '8949180b-0ce2-45f7-bf41-8974bb2d6dbf', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '8f735b53-3ddf-42b8-9a01-100a585f89dd', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a1296d7b-f6e8-49d6-86b1-398564e988c4', 'Date': 'Wed, 20 Jul 2022 14:34:11 GMT', 'Content-Length': '546')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Service \"rapids-dask-cluster-service\" is invalid: spec.ports[0].nodePort: Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767","reason":"Invalid","details":{"name":"rapids-dask-cluster-service","kind":"Service","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767","field":"spec.ports[0].nodePort"}]},"code":422}

[2022-07-20 14:34:11,443] kopf.objects         [DEBUG   ] [default/rapids-dask-cluster] Patching with: {'metadata': {'annotations': {'kopf.zalando.org/daskcluster_create': '{"started":"2022-07-20T14:34:11.423731","delayed":"2022-07-20T14:35:11.443753","purpose":"create","retries":1,"success":false,"failure":false,"message":"(422)\\nReason: Unprocessable Entity\\nHTTP response headers: <CIMultiDictProxy(\'Audit-Id\': \'8949180b-0ce2-45f7-bf41-8974bb2d6dbf\', \'Cache-Control\': \'no-cache, private\', \'Content-Type\': \'application/json\', \'X-Kubernetes-Pf-Flowschema-Uid\': \'8f735b53-3ddf-42b8-9a01-100a585f89dd\', \'X-Kubernetes-Pf-Prioritylevel-Uid\': \'a1296d7b-f6e8-49d6-86b1-398564e988c4\', \'Date\': \'Wed, 20 Jul 2022 14:34:11 GMT\', \'Content-Length\': \'546\')>\\nHTTP response body: {\\"kind\\":\\"Status\\",\\"apiVersion\\":\\"v1\\",\\"metadata\\":{},\\"status\\":\\"Failure\\",\\"message\\":\\"Service \\\\\\"rapids-dask-cluster-service\\\\\\" is invalid: spec.ports[0].nodePort: Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767\\",\\"reason\\":\\"Invalid\\",\\"details\\":{\\"name\\":\\"rapids-dask-cluster-service\\",\\"kind\\":\\"Service\\",\\"causes\\":[{\\"reason\\":\\"FieldValueInvalid\\",\\"message\\":\\"Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767\\",\\"field\\":\\"spec.ports[0].nodePort\\"}]},\\"code\\":422}\\n\\n"}'}}, 'status': {'kopf': {'progress': {'daskcluster_create': {'started': '2022-07-20T14:34:11.423731', 'stopped': None, 'delayed': '2022-07-20T14:35:11.443753', 'purpose': 'create', 'retries': 1, 'success': False, 'failure': False, 'message': '(422)\nReason: Unprocessable Entity\nHTTP response headers: <CIMultiDictProxy(\'Audit-Id\': \'8949180b-0ce2-45f7-bf41-8974bb2d6dbf\', \'Cache-Control\': \'no-cache, private\', \'Content-Type\': \'application/json\', \'X-Kubernetes-Pf-Flowschema-Uid\': \'8f735b53-3ddf-42b8-9a01-100a585f89dd\', \'X-Kubernetes-Pf-Prioritylevel-Uid\': \'a1296d7b-f6e8-49d6-86b1-398564e988c4\', \'Date\': \'Wed, 20 Jul 2022 14:34:11 GMT\', \'Content-Length\': \'546\')>\nHTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Service \\"rapids-dask-cluster-service\\" is invalid: spec.ports[0].nodePort: Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767","reason":"Invalid","details":{"name":"rapids-dask-cluster-service","kind":"Service","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767","field":"spec.ports[0].nodePort"}]},"code":422}\n\n', 'subrefs': None}}}}}
[2022-07-20 14:34:11,448] kopf.objects         [WARNING ] [default/rapids-dask-cluster] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskcluster_create': {'started': '2022-07-20T14:34:11.423731', 'stopped': None, 'delayed': '2022-07-20T14:35:11.443753', 'purpose': 'create', 'retries': 1, 'success': False, 'failure': False, 'message': '(422)\nReason: Unprocessable Entity\nHTTP response headers: <CIMultiDictProxy(\'Audit-Id\': \'8949180b-0ce2-45f7-bf41-8974bb2d6dbf\', \'Cache-Control\': \'no-cache, private\', \'Content-Type\': \'application/json\', \'X-Kubernetes-Pf-Flowschema-Uid\': \'8f735b53-3ddf-42b8-9a01-100a585f89dd\', \'X-Kubernetes-Pf-Prioritylevel-Uid\': \'a1296d7b-f6e8-49d6-86b1-398564e988c4\', \'Date\': \'Wed, 20 Jul 2022 14:34:11 GMT\', \'Content-Length\': \'546\')>\nHTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Service \\"rapids-dask-cluster-service\\" is invalid: spec.ports[0].nodePort: Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767","reason":"Invalid","details":{"name":"rapids-dask-cluster-service","kind":"Service","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: 38786: provided port is not in the valid range. The range of valid ports is 30000-32767","field":"spec.ports[0].nodePort"}]},"code":422}\n\n', 'subrefs': None}}}, None),)

We should do a little more input checking to ensure this can't happen. We should also have the controller put the DaskCluster into some kind of failure status while this is going on.

jacobtomlinson commented 1 year ago

If you try creating the example but set nodePort to something impossible the controller gets stuck in a loop.

https://kubernetes.dask.org/en/latest/operator_resources.html

https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport

We should add some validation here:

https://github.com/dask/dask-kubernetes/blob/b9d6ca5ff4ca0a21a4c45ceb571e38db789ef2f5/dask_kubernetes/operator/controller/controller.py#L83

skirui-source commented 1 year ago

I've tried creating a DaskCluster object with out-of-range nodeport and I see this:

HTTP response headers: <CIMultiDictProxy('Audit-Id': '4c8966e6-156a-4584-8640-39a7164e9949', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '9f5dce77-f6df-4593-8079-a7ece243e04d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'f5d11d69-50fc-4b73-8ba4-6e255c69420e', 'Date': 'Wed, 03 May 2023 05:32:24 GMT', 'Content-Length': '210')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"simple-scheduler\" already exists","reason":"AlreadyExists","details":{"name":"simple-scheduler","kind":"pods"},"code":409}
cluster.yaml ``` # cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: test-cluster namespace: dask-operator spec: worker: replicas: 2 spec: containers: - name: worker image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-worker - --name - $(DASK_WORKER_NAME) - --dashboard - --dashboard-address - "8788" ports: - name: http-dashboard containerPort: 8788 protocol: TCP scheduler: spec: containers: - name: scheduler image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 15 periodSeconds: 20 service: type: NodePort selector: dask.org/cluster-name: simple-node-port dask.org/component: scheduler ports: - name: tcp-comm protocol: TCP port: 8786 targetPort: "tcp-comm" nodePort: 38967 #30007 ```
jacobtomlinson commented 1 year ago

Looks like you have a dask scheduler pod with a conflicting name hanging around. Perhaps left over from a failed test run?