Closed liaopeiyuan closed 3 years ago
Hi @liaopeiyuan we don't support model parallelism yet, but we are actively taking steps towards supporting distributed training engines beyond Horovod, which will help us get there.
Are the features mentioned in the document already implemented in the main branch or are they in progress?
There's a fair bit of dependent work that needs to happen before we work on the feature outlined in the link I shared earlier.
I'll post an update once we commence.
@liaopeiyuan Would you please help us understand the kinds of models you're attempting to train via model parallelism, so we can test with them, as we develop the feature? Also, how do you do model-parallel training today?
We are training a model that requires splitting a multi-tasked model (multi-input, multi-output) inside a single node w. multiple GPUs, and have data-parallel between nodes.
Today we are using DeepSpeed or a combination of GPipe + Horovord depends on the situation. The problem for us right now is that neither is elastic.
Thanks for the context @liaopeiyuan, that helps.
We've been discussing how to integrate DeepSpeed, FairScale and Pipedream into Determined recently, so knowing there is active interest, and usage, helps us plan and prioritize.
I'll close out this issue.
Please feel free to open another if you have additional questions, and/or we'd prefer if you'd join our Slack channel.
@liaopeiyuan Hi Peiyuan, we are developing autotuner for our integration with DeepSpeed, which allows users to use automatically optimized configuration for model training speed. Would you be interested in learning more about it and getting early access to it? I came across this Github issue and wanted to check with you. Also, I am interested in knowing if you have recent thoughts about Determined?
e.g. may I use multiple gpus for a single worker and split the model between them? Thanks!