Open jacobtomlinson opened 2 years ago
I see the distributed.yaml
configuration as a global configuration. I don't think we should encourage per-instance configuration
If something is configurable by a CLI flag I feel it should also be configurable via an environment variable. Happy to use something separate to distributed.yaml
if you would prefer.
Do you have any suggestions on a solution here?
Click proposes dynamic defaults for something like this, see https://click.palletsprojects.com/en/7.x/options/#dynamic-defaults-for-prompts
I think some of the CLI arguments make good configuration options while others are very instance specific. E.g. I don't think we should configure a port using the config system but it is perfectly fine to toggle the dashboard.
I believe my issue is that the dask config system is designed to be global. It even parses a bunch of prefixes like ~/.config/dask
, /etc/dask/
, etc.) and I would use it for options I want to be true for the entire deployment (e.g. connection timeouts), not just for individual instances of the deployment (e.g. ports).
I think there are already some options in there which live in a grey area. If you feel differently about this topic, we can introduce more options, I don't want to block on this.
Side note: A secondary concern of mine is the inflation of options in distributed.yaml and how they are documented. I think this is not very user friendly and I'm typically pushing back on introducing more if it is not absolutely necessary. Your proposed change would introduce a lot more which is making me a bit nervous (it's also hard to remove again)
To address your secondary concern first I totally agree. Dask is insanely configurable and that can be a hindrance as much as it can be powerful.
I guess my intention here is that we have already made a bunch of things configurable by introducing them as CLI flags. But passing those requires a bunch of string manipulation. The dask-spec
command helps somewhat when deployment is totally obscured from the user and handled in Python. However, in the Dask Operator users can also create clusters in YAML and I can't expect them to include string blobs of JSON in their config files.
Given that these configurable things already exist I would like to be able to set them via environment variables as merging a user provided dict with a default dict is much easier than string manipulation to merge two CLI commands.
Keeping distributed.yaml
for cluster-wide options makes a lot of sense. The click approach of reading env vars within the click decorators would be fine, but it does mean that we need a different namespace for the environment variables I need.
Currently I can set a few things like the worker's maximum lifespan with DASK__DISTRIBUTED__WORKER__LIFETIME__DURATION
which gets loaded by dask.config
and merged with the distributed.yaml
. So I don't want to collide with those. I can also set DASK_SCHEDULER_ADDRESS
which is a legacy env var that is special-cased and also maps onto distributed.yaml
.
Perhaps for the per-instance options I need to configure for the worker like --name
, --listen-address
and --contact_address
they should use a scheme similar to the scheduler address like DASK_WORKER_NAME
, DASK_WORKER_LISTEN_ADDRESS
, DASK_WORKER_CONTACT_ADDRESS
.
My only worry is that introducing another config option scheme increases complexity and could confuse users, which is why I suggested reusing the existing config system initially.
Do you have thoughts on a way forward?
I'm totally fine with worker options being in config. We already do this a bunch sporadically throughout the worker. I'd love to see more of it personally.
I see the distributed.yaml configuration as a global configuration. I don't think we should encourage per-instance configuration
Eh, often people bundle a config file along with every individual process that we run. "Global" here really just means "global to a filesystem" and in today's world of containers that's pretty much "global to a single instance"
It seems that many of the
dask-worker
CLI flags andWorker
/Nanny
kwargs are not configurable via the Dask config system.In the work we are doing on the Dask Kubernetes Operator it would be really useful for more things to be configurable via environment variables. Setting things like the worker name, listen address and contact address would be particularly useful.
I intend to make a PR to add more of these options to the
distributed.yaml
config.