dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Adaptive docs could do with a bit more details #1789

Open mattilyra opened 6 years ago

mattilyra commented 6 years ago

I've been piecing together an auto-scaling dask cluster on AWS, using adaptive and bits from dask_ec2. It would be really useful to know what the semantics of the scale_up and scale_down calls are, especially if scale_up should return immediately, block, return a future or something else. It isn't clear from the examples (so I'm currently reading the sources).

My main concern here is what happens when long blocking call are made inside scale_up or scale_down - for instance bringing up new AWS instances?

I'm guessing that the best way to approach this is to return a future by using tornado.gen.coroutine.

mrocklin commented 6 years ago

Yeah, that's a fair concern. The only docs we have now are here: http://distributed.readthedocs.io/en/latest/adaptive.html

Have you seen these docs yet and have issue with them, or were you unaware that they existed?

mattilyra commented 6 years ago

Sorry, should've mentioned. Yes I did read through that as well as the full blog post on using marathon and a few blog posts from the MET Office inf lab about daskernetes but couldn't find any details on what the semantics of the return type from scale_up are?

mrocklin commented 6 years ago

I don't think the semantics of scale_up are yet defined. Adaptive deployments have been around for a while, but I think that the community is still learning best practices around them.

mrocklin commented 6 years ago

When I said "the semantics" I meant the return values in particular, but the general theme holds true as well.

mattilyra commented 6 years ago

Right. So, for now, the best thing is to a) not block and b) return a future?

mrocklin commented 6 years ago

You shouldn't block. You can choose to return a future or not. You might want to look at the source code for Adaptive._adapt

https://github.com/dask/distributed/blob/6f317039d1067fdcbf1d013dc2ea87aad700871e/distributed/deploy/adaptive.py#L210-L231

mrocklin commented 6 years ago

Also I'm curious, what are you planning to base this off of on AWS?

We've recently started recommending Kubernetes everywhere: http://dask.pydata.org/en/latest/setup/cloud.html . I'm curious the reasons why people might choose other technologies on AWS. I'm also curious if you've considered Fargate.

mattilyra commented 6 years ago

Yes, I'm basing this on AWS, I looked at Kubernetes, Mesos, Fargate/ECS. Couldn't get Kubernetes installed (tried conjure-up and a manual install).

I also don't need an always on 100% available cluster, but basically a throwaway cluster that lives for a few hours per week. Fargate isn't available in my region (eu-west) and is based on ECS clusters. The ECS clusters are a total pain in the a** as everything is based on ECS tasks, so I can't just submit a bunch of work to dask and have it figure out how many, which kinds of nodes to launch, but instead the nodes need to be pre-configured in the ECS task definitions.

I also looked at docker-swarm, but that doesn't support auto-scaling.

So I've currently hacked together a --preload script with hooks into adaptive that then manages the nodes via dask_ec2/boto3.

The kubernetes possibly together with daskernetes looks really promising though, if only I already had a kubernetes installation in place.

mrocklin commented 6 years ago

Amazon now supports a managed kubernetes service: https://aws.amazon.com/eks/

I haven't used it myself though and don't know if it is deployed in all regions.

Also, FYI, dask_ec2 seems to be unmaintained

mattilyra commented 6 years ago

hmm, ok - yeah none of the code really relies on dask_ec2, that project is basically just a thin wrapper on boto3 anyhow.

mrocklin commented 6 years ago

Grand

mrocklin commented 6 years ago

See also #1796