NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
941 stars 220 forks source link

Safeguarding against usage of uninitialized DistributedManager #475

Closed akshaysubr closed 4 months ago

akshaysubr commented 5 months ago

Modulus Pull Request

Description

closes #474

Should be merged in after #469

Checklist

Dependencies

None

akshaysubr commented 5 months ago

/blossom-ci

mnabian commented 5 months ago

/blossom-ci

akshaysubr commented 5 months ago

/blossom-ci

akshaysubr commented 5 months ago

/blossom-ci

azrael417 commented 5 months ago

What it does makes sense. Is this fix supposed to safeguard against initialized DM but requesting an uninitialized distributed group?

akshaysubr commented 4 months ago

@azrael417 Not quite. This PR is safeguarding against using the manager before calling DistributedManager.initialize() first. There was a bug in CorrDiff where this was silently happening causing a multi GPU job to behave like independent single GPU jobs since that's the default.

akshaysubr commented 4 months ago

/blossom-ci

akshaysubr commented 4 months ago

/blossom-ci