Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!
Apache License 2.0
7.83k stars 742 forks source link

Support HPC & GPU local clusters #52

Open oguitart opened 4 years ago

oguitart commented 4 years ago

Is is possible to use a local HPC or GPU cluster? I understand that it works perfectly with AWS but what about when the use of AWS is not possible but there are other resources available? Can it be configured to use other resources? Thanks,


savingoyal commented 4 years ago

Please check #29 .

romain-intel commented 4 years ago

To add a bit of flavor on #29 which was very specific to slurm and accelerators, the batch plugin that is currently included is basically a thin way of launching a process on a remote machine (in this case a batch instance). You could write a similar plugin for your specific environment. You would basically need to provide:

Without knowing your exact setup, it's hard for me to help further but hopefully this helps a little bit.

oguitart commented 4 years ago

Thank you for the explanation. We have an HPC cluster with SGE and a GPU cluster with Slurm. So I think your suggestion of creating a plugin, it should be the right way. I need to start checking documentation and code to understand how to create this kind of plugin.

romain-intel commented 4 years ago

We currently have little documentation regarding the internals of Metaflow (our initial release is primarily targeted at users of metaflow rather than the developers of metaflow). Feel free, however, to ask any question here. To get you started, the batch plugin is in plugins/aws/batch. In there, a few things to keep in mind:

Let me know if you have more questions. As I mentioned, happy to help give you more information/help if needed. Please also see caveats I posted in #29.

oguitart commented 4 years ago

Thank you very much for all the information. I'll let you know if I have any questions,

dgasmith commented 4 years ago

You may want to consider opening this up to arbitrary workload managers such as Parsl, Dask (job queue), RADICAL, etc. While these are typically full workflow/workload managers their core workload capability can be used to execute arbitrary tasks on these machines. Some serious thought will need to go into this as you manage your own environments as well, but there has been some pretty hefty work getting these kinds of task management systems onto SLURM/PBS and general academic clusters/leadership platforms.

IanQS commented 2 years ago

Is it possible to run metaflow across a local cluster of machines? I've got a cluster of machines locally and I'd rather use that than prematurely deploy to AWS when I may not need it? I've tried googling "metaflow local cluster" but this was the first result and the rest didn't look particularly relevant (all of them advocate training and verifying locally before scaling to AWS)

savingoyal commented 2 years ago

@IanQS If you can deploy Kubernetes on top of this cluster, then our latest release which adds support for Kubernetes will get you going.