NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
937 stars 219 forks source link

🚀[FEA]: Distributed Training/Inference: handle scatter/gather better and more consistently #520

Open stadlmax opened 4 months ago

stadlmax commented 4 months ago

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Low (would be nice)

Please provide a clear description of problem you would like to solve.

Problem exists in model-parallel settings where not all ranks have valid tensors, mainly around gather and scatter routines.

Scatter

Gather

Potential Solution

Describe any alternatives you have considered

No response