coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Upload local directory #153

Closed mrocklin closed 3 years ago

mrocklin commented 3 years ago

I spent a lot of today playing around with a new feature in Dask, using Coiled as part of the prototyping. Part of my development cycle was to build a local docker image with my changes, push it up, and then create a new cluster pointing to that image.

Probably it would have been nicer to capture all of my local directory and ship it up to all of the machines in my cluster. In my case that would have been around 10MB (especially if we automatically omitted the .git folder). We would have then had to send that 10MB to each of the VMs after they spun up, but before they started any Dask code.

I suspect that an option like this would be pragmatic, but I might be over-focusing on my recent experience.

mrocklin commented 3 years ago

I started playing with this locally, curious if i could solve it purely on the dask side. We could, but only for the workers. We don't have a mechanism to restart the scheduler process currently.

My plan was to build an UploadDirectory plugin (much like UploadFile) that used the zip module. I found this stack overflow post on zip files and directories. I was going to make a zip file locally, then use upload file to push it up, and then unpack it and call restart. One challenge with plugins that restart on setup is that they keep getting called. We would probably want to use the local disk to write a small file with the version the ID of the plugin that we had so that we could look for that and decide not to restart things if so.

It may make sense to do this on the Dask side anyway. I suspect that most folks don't care about updating the scheduler as much as I do.

necaris commented 3 years ago

Matthew Rocklin @.***> writes:

I started playing with this locally, curious if i could solve it purely on the dask side. We could, but only for the workers. We don't have a mechanism to restart the scheduler process currently.

My plan was to build an UploadDirectory plugin (much like UploadFile) that used the zip module. I found this stack overflow post on zip files and directories. I was going to make a zip file locally, then use upload file to push it up, and then unpack it and call restart. One challenge with plugins that restart on setup is that they keep getting called. We would probably want to use the local disk to write a small file with the version the ID of the plugin that we had so that we could look for that and decide not to restart things if so.

It may make sense to do this on the Dask side anyway. I suspect that most folks don't care about updating the scheduler as much as I do.

Forgive my ignorance, but what scenarios are there where we'd need the scheduler to have the same dataset as the workers?

I'd love to hear @ian-r-rose's thoughts on this, too.

mrocklin commented 3 years ago

Not dataset. Code.

In my situation I'm changing the scheduler code rapidly and want that code updated for each new cluster. I'm probably not the typical use case though. I think that doing this on the worker side makes the most sense. I think that I can implement all of this on the Dask side.

ian-r-rose commented 3 years ago

I agree that this can probably be done all on the dask side.

I'm curious about your use-case @mrocklin -- not only would you need to upload your working directory, but you'd also want to do it in such a way that upon restarting the scheduler/workers that directory is in the PYTHONPATH or similar. Or do I misunderstand what you mean by restarting?

mrocklin commented 3 years ago

That's correct. upload_file puts things in a directory that is on the PYTHONPATH. We do this for scripts today. I now just want to expand this to directories so that I can start pushing up all of dask/

necaris commented 3 years ago

Matthew Rocklin @.***> writes:

Not dataset. Code.

In my situation I'm changing the scheduler code rapidly and want that code updated for each new cluster. I'm probably not the typical use case though. I think that doing this on the worker side makes the most sense. I think that I can implement all of this on the Dask side.

The quickest way I can think of to update that code for every new cluster would be to use a post-build command when creating the software environment that installs distributed from your fork on GitHub. But then you'd have to keep rebuilding the software environment, which takes a few extra seconds :-(

mrocklin commented 3 years ago

I think that we're good here. I'll just have this done on the Dask side.

On Wed, Jul 21, 2021 at 7:58 AM Rami Chowdhury @.***> wrote:

Matthew Rocklin @.***> writes:

Not dataset. Code.

In my situation I'm changing the scheduler code rapidly and want that code updated for each new cluster. I'm probably not the typical use case though. I think that doing this on the worker side makes the most sense. I think that I can implement all of this on the Dask side.

The quickest way I can think of to update that code for every new cluster would be to use a post-build command when creating the software environment that installs distributed from your fork on GitHub. But then you'd have to keep rebuilding the software environment, which takes a few extra seconds :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coiled/feedback/issues/153#issuecomment-884257819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTENHZHFKFC2TQTUAMDTY3OB7ANCNFSM5AW7IMXA .

mrocklin commented 3 years ago

This will solve everyone's problem except for mine :)

On Wed, Jul 21, 2021 at 8:50 AM Matthew Rocklin @.***> wrote:

I think that we're good here. I'll just have this done on the Dask side.

On Wed, Jul 21, 2021 at 7:58 AM Rami Chowdhury @.***> wrote:

Matthew Rocklin @.***> writes:

Not dataset. Code.

In my situation I'm changing the scheduler code rapidly and want that code updated for each new cluster. I'm probably not the typical use case though. I think that doing this on the worker side makes the most sense. I think that I can implement all of this on the Dask side.

The quickest way I can think of to update that code for every new cluster would be to use a post-build command when creating the software environment that installs distributed from your fork on GitHub. But then you'd have to keep rebuilding the software environment, which takes a few extra seconds :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coiled/feedback/issues/153#issuecomment-884257819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTENHZHFKFC2TQTUAMDTY3OB7ANCNFSM5AW7IMXA .

mrocklin commented 3 years ago

Reported upstream https://github.com/dask/distributed/issues/5117