determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.01k stars 350 forks source link

Is model parallelism supported on PyTorch? #2215

Closed liaopeiyuan closed 3 years ago

liaopeiyuan commented 3 years ago

e.g. may I use multiple gpus for a single worker and split the model between them? Thanks!

vishnu2kmohan commented 3 years ago

Hi @liaopeiyuan we don't support model parallelism yet, but we are actively taking steps towards supporting distributed training engines beyond Horovod, which will help us get there.

liaopeiyuan commented 3 years ago

Are the features mentioned in the document already implemented in the main branch or are they in progress?

vishnu2kmohan commented 3 years ago

There's a fair bit of dependent work that needs to happen before we work on the feature outlined in the link I shared earlier.

I'll post an update once we commence.

@liaopeiyuan Would you please help us understand the kinds of models you're attempting to train via model parallelism, so we can test with them, as we develop the feature? Also, how do you do model-parallel training today?

liaopeiyuan commented 3 years ago

We are training a model that requires splitting a multi-tasked model (multi-input, multi-output) inside a single node w. multiple GPUs, and have data-parallel between nodes.

Today we are using DeepSpeed or a combination of GPipe + Horovord depends on the situation. The problem for us right now is that neither is elastic.

vishnu2kmohan commented 3 years ago

Thanks for the context @liaopeiyuan, that helps.

We've been discussing how to integrate DeepSpeed, FairScale and Pipedream into Determined recently, so knowing there is active interest, and usage, helps us plan and prioritize.

vishnu2kmohan commented 3 years ago

I'll close out this issue.

Please feel free to open another if you have additional questions, and/or we'd prefer if you'd join our Slack channel.

shiyuann commented 1 year ago

@liaopeiyuan Hi Peiyuan, we are developing autotuner for our integration with DeepSpeed, which allows users to use automatically optimized configuration for model training speed. Would you be interested in learning more about it and getting early access to it? I came across this Github issue and wanted to check with you. Also, I am interested in knowing if you have recent thoughts about Determined?