clessig / atmorep

AtmoRep model code
MIT License
35 stars 9 forks source link

Performance bug for multi-field configuration in train_continue and num_loader_workers >= 2 #4

Closed iluise closed 2 weeks ago

iluise commented 9 months ago

Opening a performance bug for the following error when training the multiformer on large number of nodes (in this case 32) and num_loader_workers >= 2:

19:   warnings.warn(_create_warning_msg(
19: Traceback (most recent call last):
19:   File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/train_multi12h.py", line 239, in <module>
19:     train_continue( model_id, model_epoch, Trainer, model_epoch_continue)
19:   File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/train_multi12h.py", line 68, in train_continue
19:     trainer.run( model_epoch_continue)
19:   File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/trainer.py", line 206, in run
19:     self.train( epoch)
19:   File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/trainer.py", line 234, in train
19:     model.mode( NetMode.train)
19:   File "/p/project/atmo-rep/ilaria/atmorep/atmorep_github/atmorep/atmorep/core/atmorep_model.py", line 167, in mode
19:     self.data_loader_iter = iter(self.data_loader_train)
19:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 438, in __iter__
19:     return self._get_iterator()
19:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/datalo
19: ader.py", line 386, in _get_iterator
19:     return _MultiProcessingDataLoaderIter(self)
19:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1039, in __init__
19:     w.start()
19:   File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
19:     self._popen = self._Popen(self)
19:   File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
19:     return _default_context.get_context().Process._Popen(process_obj)
19:   File "/usr/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
19:     return Popen(process_obj)
19:   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
19:     self._launch(process_obj)
19:   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 66, in _launch
19:     self.pid = os.fork()
19: OSError: [Errno 12] Cannot allocate memory