NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
798 stars 174 forks source link

🐛[BUG]: DistributedManager gets silently initialized as a single process job if instantiated before initializing #474

Closed akshaysubr closed 1 month ago

akshaysubr commented 2 months ago

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

This works as expected:

In [1]: from modulus.distributed import DistributedManager

In [2]: DistributedManager.is_initialized()
Out[2]: False

In [3]: DistributedManager.initialize()

In [4]: DistributedManager.is_initialized()
Out[4]: True

In [5]: manager = DistributedManager()

In [6]: manager._initialization_method
Out[8]: 'None'

but this does not:

  In [1]: from modulus.distributed import DistributedManager                                                                                                                                                                                                                                                                    

  In [2]: manager = DistributedManager()                                                                                                                                                                                                                                                                                        

  In [3]: manager._initialization_method                                                                                                                                                                                                                                                                                        
  Out[3]: 'None'                                                                                                                                                                                                                                                                                                                

  In [4]: manager.is_initialized()                                                                                                                                                                                                                                                                                              
  Out[4]: True     

Minimum reproducible example

In [1]: from modulus.distributed import DistributedManager                                                                                                                                                                                                                                                                    

  In [2]: manager = DistributedManager()                                                                                                                                                                                                                                                                                        

  In [3]: manager._initialization_method                                                                                                                                                                                                                                                                                        
  Out[3]: 'None'                                                                                                                                                                                                                                                                                                                

  In [4]: manager.is_initialized()                                                                                                                                                                                                                                                                                              
  Out[4]: True                                                                                                                                                                                                                                                                                                                  

  In [5]: manager.initialize()                                                                                                                                                                                                                                                                                                  
  /code/modulus-core/modulus/distributed/manager.py:302: UserWarning: Distributed manager is already intialized                                                                                                                                                                                                                 
    warn("Distributed manager is already intialized")

Relevant log output

No response

Environment details

No response

akshaysubr commented 2 months ago

One of the reasons this is happening is because the initialization check in the DistributedManager is based on checking the size of DistributedManager._shared_state: https://github.com/NVIDIA/modulus/blob/main/modulus/distributed/manager.py#L194-L197

This silent initialization can be caught by having an explicit _is_initialized member in the Borg class and only setting that to True in the initialize method.

akshaysubr commented 2 months ago

@tge25 @dallasfoster Would this be a better way to prevent accidental usage of the DistributedManager before it is initialized?

In [1]: from modulus.distributed import DistributedManager

In [2]: DistributedManager.is_initialized()
Out[2]: False

In [3]: manager = DistributedManager()
---------------------------------------------------------------------------
ModulusUninitializedDistributedManagerWarningTraceback (most recent call last)
Cell In[3], line 1
----> 1 manager = DistributedManager()

File /code/modulus-core/modulus/distributed/manager.py:115, in DistributedManager.__init__(self)
    113 def __init__(self):
    114     if not self._is_initialized:
--> 115         raise ModulusUninitializedDistributedManagerWarning()
    116     super().__init__()

ModulusUninitializedDistributedManagerWarning: Instantiating DistributedManager before calling DistributedManager.initialize is not recommended