Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.23k stars 773 forks source link

Support for Slurm? #29

Open dgrahn opened 4 years ago

dgrahn commented 4 years ago

All,

I'm working on setting up a new DSS-8440 and am evaluating different management options. It appears that Slurm is best for job scheduling. Does metaflow support or have any integration with Slurm? Alternatively, are there any tips for handling machines like this?

Thank!

savingoyal commented 4 years ago

@dgrahn We currently don't integrate with Slurm, but it's an interesting idea. @romain-intel Do you have any suggestions for Slurm alternatives for HPC workloads?

romain-intel commented 4 years ago

It's been a while and I had worked with Torque at the time but Slurm is definitely something we can consider. I suppose if you are mentioning Slurm, you intend on setting up multiple DSS-8440 in a cluster and would like to evaluate if Metaflow can potentially help with running large workloads on such a cluster. It could work but I am not sure it is quite ready for prime-time. You would still need to setup quite a few things to make sure that it all works properly (slurm, isolation of the accelerators if you wanted to share them among multiple flows, etc). Metaflow would also not necessarily benefit from some of the benefits of Slurm (request for multiple nodes at the same time for example since Metaflow "communicates" through a central location (S3 in the case of the AWS integration but you could imagine a shared file-system in a cluster-like setup).

On a single DSS-8440, you could use Metaflow without Slurm as well and make use of the multiple accelerators that way (since Metaflow would launch multiple processes on the same machine).

I am not sure of your exact use case but happy to discuss a little bit more. It's been a while since I worked in HPC but can hopefully still have a somewhat coherent conversation :).

dgrahn commented 4 years ago

@romain-intel It'll be one DSS-8440 and few lower-powered GPU machines that were previously available. It sounds like metaflow might not be the right technology for that use case at this point in time.

romain-intel commented 4 years ago

@dgrahn : It would be able to work but there would definitely need to be some legwork to make it work :). If there is interest in working on a Slurm plugin (the first step to getting this to be compatible with Slurm), we can definitely help but probably would need some support from the community to test and what not.

Extremys commented 8 months ago

Any progress? :)

savingoyal commented 2 months ago

Support for SLURM is now available in Metaflow. Please reach out on chat.metaflow.org if you would like to give it a try.