dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Add worker config options to config system #6375

Open jacobtomlinson opened 2 years ago

jacobtomlinson commented 2 years ago

It seems that many of the dask-worker CLI flags and Worker/Nanny kwargs are not configurable via the Dask config system.

In the work we are doing on the Dask Kubernetes Operator it would be really useful for more things to be configurable via environment variables. Setting things like the worker name, listen address and contact address would be particularly useful.

I intend to make a PR to add more of these options to the distributed.yaml config.

fjetter commented 2 years ago

I see the distributed.yaml configuration as a global configuration. I don't think we should encourage per-instance configuration

jacobtomlinson commented 2 years ago

If something is configurable by a CLI flag I feel it should also be configurable via an environment variable. Happy to use something separate to distributed.yaml if you would prefer.

Do you have any suggestions on a solution here?

fjetter commented 2 years ago

Click proposes dynamic defaults for something like this, see https://click.palletsprojects.com/en/7.x/options/#dynamic-defaults-for-prompts

I think some of the CLI arguments make good configuration options while others are very instance specific. E.g. I don't think we should configure a port using the config system but it is perfectly fine to toggle the dashboard.

I believe my issue is that the dask config system is designed to be global. It even parses a bunch of prefixes like ~/.config/dask, /etc/dask/, etc.) and I would use it for options I want to be true for the entire deployment (e.g. connection timeouts), not just for individual instances of the deployment (e.g. ports). I think there are already some options in there which live in a grey area. If you feel differently about this topic, we can introduce more options, I don't want to block on this.

Side note: A secondary concern of mine is the inflation of options in distributed.yaml and how they are documented. I think this is not very user friendly and I'm typically pushing back on introducing more if it is not absolutely necessary. Your proposed change would introduce a lot more which is making me a bit nervous (it's also hard to remove again)

jacobtomlinson commented 2 years ago

To address your secondary concern first I totally agree. Dask is insanely configurable and that can be a hindrance as much as it can be powerful.

I guess my intention here is that we have already made a bunch of things configurable by introducing them as CLI flags. But passing those requires a bunch of string manipulation. The dask-spec command helps somewhat when deployment is totally obscured from the user and handled in Python. However, in the Dask Operator users can also create clusters in YAML and I can't expect them to include string blobs of JSON in their config files.

Given that these configurable things already exist I would like to be able to set them via environment variables as merging a user provided dict with a default dict is much easier than string manipulation to merge two CLI commands.

Keeping distributed.yaml for cluster-wide options makes a lot of sense. The click approach of reading env vars within the click decorators would be fine, but it does mean that we need a different namespace for the environment variables I need.

Currently I can set a few things like the worker's maximum lifespan with DASK__DISTRIBUTED__WORKER__LIFETIME__DURATION which gets loaded by dask.config and merged with the distributed.yaml. So I don't want to collide with those. I can also set DASK_SCHEDULER_ADDRESS which is a legacy env var that is special-cased and also maps onto distributed.yaml.

Perhaps for the per-instance options I need to configure for the worker like --name, --listen-address and --contact_address they should use a scheme similar to the scheduler address like DASK_WORKER_NAME, DASK_WORKER_LISTEN_ADDRESS, DASK_WORKER_CONTACT_ADDRESS.

My only worry is that introducing another config option scheme increases complexity and could confuse users, which is why I suggested reusing the existing config system initially.

Do you have thoughts on a way forward?

mrocklin commented 2 years ago

I'm totally fine with worker options being in config. We already do this a bunch sporadically throughout the worker. I'd love to see more of it personally.

I see the distributed.yaml configuration as a global configuration. I don't think we should encourage per-instance configuration

Eh, often people bundle a config file along with every individual process that we run. "Global" here really just means "global to a filesystem" and in today's world of containers that's pretty much "global to a single instance"