Open mattilyra opened 6 years ago
Yeah, that's a fair concern. The only docs we have now are here: http://distributed.readthedocs.io/en/latest/adaptive.html
Have you seen these docs yet and have issue with them, or were you unaware that they existed?
Sorry, should've mentioned. Yes I did read through that as well as the full blog post on using marathon
and a few blog posts from the MET Office inf lab about daskernetes
but couldn't find any details on what the semantics of the return type from scale_up
are?
I don't think the semantics of scale_up are yet defined. Adaptive deployments have been around for a while, but I think that the community is still learning best practices around them.
When I said "the semantics" I meant the return values in particular, but the general theme holds true as well.
Right. So, for now, the best thing is to a) not block and b) return a future?
You shouldn't block. You can choose to return a future or not. You might want to look at the source code for Adaptive._adapt
Also I'm curious, what are you planning to base this off of on AWS?
We've recently started recommending Kubernetes everywhere: http://dask.pydata.org/en/latest/setup/cloud.html . I'm curious the reasons why people might choose other technologies on AWS. I'm also curious if you've considered Fargate.
Yes, I'm basing this on AWS, I looked at Kubernetes, Mesos, Fargate/ECS. Couldn't get Kubernetes installed (tried conjure-up
and a manual install).
I also don't need an always on 100% available cluster, but basically a throwaway cluster that lives for a few hours per week. Fargate isn't available in my region (eu-west) and is based on ECS clusters. The ECS clusters are a total pain in the a** as everything is based on ECS tasks, so I can't just submit a bunch of work to dask
and have it figure out how many, which kinds of nodes to launch, but instead the nodes need to be pre-configured in the ECS task definitions.
I also looked at docker-swarm, but that doesn't support auto-scaling.
So I've currently hacked together a --preload
script with hooks into adaptive
that then manages the nodes via dask_ec2
/boto3
.
The kubernetes
possibly together with daskernetes
looks really promising though, if only I already had a kubernetes installation in place.
Amazon now supports a managed kubernetes service: https://aws.amazon.com/eks/
I haven't used it myself though and don't know if it is deployed in all regions.
Also, FYI, dask_ec2
seems to be unmaintained
hmm, ok - yeah none of the code really relies on dask_ec2
, that project is basically just a thin wrapper on boto3
anyhow.
Grand
See also #1796
I've been piecing together an auto-scaling
dask
cluster on AWS, usingadaptive
and bits fromdask_ec2
. It would be really useful to know what the semantics of thescale_up
andscale_down
calls are, especially ifscale_up
should return immediately, block, return a future or something else. It isn't clear from the examples (so I'm currently reading the sources).My main concern here is what happens when long blocking call are made inside
scale_up
orscale_down
- for instance bringing up new AWS instances?I'm guessing that the best way to approach this is to return a future by using
tornado.gen.coroutine
.