Open jacobtomlinson opened 2 years ago
Hello I am trying to simulate this on docker compose before going to K8S yaml. We have a lot of restrictions at work so we can't install the operator. The Istio mesh we are using enforces an authorization policy with specific ports for zero trust networking so I am hoping I can make this work. I have created a simulated "docker compose equivalent" to look at the logs.
The below docker compose works but when I look at the logs, the scheduler is showing the following.
distributed.core - INFO - Starting established connection to tcp://192.168.0.6:35998
That random port I am not sure if the scheduler is connecting directly to or that is happening locally inside the worker. I am not sure if I can specify that port or maybe that is the nanny port. I also saw that Dask can use websockets but I couldnt see any reference. I wish it used a message bus like Redis or NATS to do the networking.
version: "3.1"
services:
scheduler:
image: ghcr.io/dask/dask:latest
hostname: scheduler
ports:
- "8786:8786"
- "8787:8787"
command: ["dask-scheduler", "--dashboard-prefix=dask"]
worker:
image: ghcr.io/dask/dask:latest
command:
- dask
- worker
- tcp://scheduler:8786
- --contact-address=tcp://worker:9001
- --listen-address=tcp://0.0.0.0:9001
worker2:
image: ghcr.io/dask/dask:latest
command:
- dask
- worker
- tcp://scheduler:8786
- --contact-address=tcp://worker2:9002
- --listen-address=tcp://0.0.0.0:9002
notebook:
image: ghcr.io/dask/dask-notebook:latest
ports:
- "8888:8888"
environment:
- DASK_SCHEDULER_ADDRESS="tcp://scheduler:8786"
We have a lot of restrictions at work so we can't install the operator.
I was worried about this when we went down the operator road. Do you mean you can't install the operator because you don't have enough permission, or that your k8s admin team wont install the operator? Is this something we can resolve with better documentation for admin teams? Or some specific features in the controller? Given the direction the k8s community is going with operators I expect all k8s clusters must have at least a few operators installed (like the Istio one for example). How can we help with operator installation in high-friction environments?
The below docker compose works but when I look at the logs, the scheduler is showing the following.
distributed.core - INFO - Starting established connection to tcp://192.168.0.6:35998
That random port I am not sure if the scheduler is connecting directly to or that is happening locally inside the worker. I am not sure if I can specify that port or maybe that is the nanny port.
When running your compose file I'm not seeing that line in the logs. Are you doing something in the notebook to make that happen? I played around with being very explicit with ports and pinned down the nanny and dashboard too. This configuration does assume one worker per nanny though, which may not always be the case depending on your pod size. I also made the ports consistent on each worker given that they will all have different IPs.
version: "3.1"
services:
scheduler:
image: ghcr.io/dask/dask:latest
hostname: scheduler
ports:
- "8786:8786"
- "8787:8787"
command: ["dask-scheduler", "--dashboard-prefix=dask"]
worker:
image: ghcr.io/dask/dask:latest
command:
- dask
- worker
- tcp://scheduler:8786
- --contact-address=tcp://worker:9001
- --listen-address=tcp://0.0.0.0:9001
- --nanny-port=9002
- --dashboard-address=0.0.0.0:9003
worker2:
image: ghcr.io/dask/dask:latest
command:
- dask
- worker
- tcp://scheduler:8786
- --contact-address=tcp://worker2:9001
- --listen-address=tcp://0.0.0.0:9001
- --nanny-port=9002
- --dashboard-address=0.0.0.0:9003
notebook:
image: ghcr.io/dask/dask-notebook:latest
ports:
- "8888:8888"
environment:
- DASK_SCHEDULER_ADDRESS="tcp://scheduler:8786"
Do you need anything beyond this?
I also saw that Dask can use websockets but I couldnt see any reference.
You can switch to websockets by changing the protocol.
version: "3.1"
services:
scheduler:
image: ghcr.io/dask/dask:latest
hostname: scheduler
ports:
- "8786:8786"
- "8787:8787"
command: ["dask-scheduler", "--dashboard-prefix=dask", "--protocol=ws"]
worker:
image: ghcr.io/dask/dask:latest
command:
- dask
- worker
- ws://scheduler:8786
- --contact-address=ws://worker:9001
- --listen-address=ws://0.0.0.0:9001
- --nanny-port=9002
- --dashboard-address=0.0.0.0:9003
worker2:
image: ghcr.io/dask/dask:latest
command:
- dask
- worker
- ws://scheduler:8786
- --contact-address=ws://worker2:9001
- --listen-address=ws://0.0.0.0:9001
- --nanny-port=9002
- --dashboard-address=0.0.0.0:9003
notebook:
image: ghcr.io/dask/dask-notebook:latest
ports:
- "8888:8888"
environment:
- DASK_SCHEDULER_ADDRESS="ws://scheduler:8786"
From an Istio perspective I expect this means you could name ports with http-
instead of tcp-
, but that's about it.
@jacobtomlinson - the issue with the operator - I work for a large bank and their private cloud (Kubernetes) is used by thousands of engineers. The risk is that they install one operator that weakens security or could potentially render performance issues across the cluster. They prefer not installing operators in general and its very hard to get it past compliance. I managed to get the above working in our Kubernetes environment and I think potentially a good idea is to use Stateful sets like NATS does to add nodes to the NATS cluster - here is an example: https://github.com/dataplane-app/dataplane/blob/main/quick-start/kubernetes/nats-deployment.yaml
Sure that all makes sense. It sounds like you want the Dask Helm Chart rather than the Dask Operator. There is definitely room for improvement in that chart, using StatefulSet
s would be a nice change. It probably also doesn't play nicely with Istio but that could easily be fixed.
A follow up to #482 which was solved by disabling Istio sidecar injection on our pods in #496. It would be nice to get actual support working.
Tasks
"sidecar.istio.io/inject": "false",
label488 makes a start but isn't working yet.
Some context from #482: