NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
836 stars 190 forks source link

🐛[BUG]: Dangerous treatment of None in DistributedManager #406

Closed azrael417 closed 3 months ago

azrael417 commented 3 months ago

Version

current main

On which installation method(s) does this occur?

No response

Describe the issue

This behavior is dangerous: when querying the DistributedManager for a group size for a group name which was not created, it will first return None here:

https://github.com/NVIDIA/modulus/blob/0e3da620efec3101fbda62b85b33dd862b945e09/modulus/distributed/manager.py#L131

and pass this on to dist.get_size() which will return the world group size. I think a better behavior instead would be to error out (maybe the cleanest), or to return 1 for the size and 0 for the rank.

Please let me know what you think

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response