eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
57 stars 12 forks source link

Error when training WMT with multi gpu #131

Open isanvicente opened 1 day ago

isanvicente commented 1 day ago

Hi!

my setup:

Traceback (most recent call last):
  File "/mnt/nfs/NMT/eole/eole/utils/distributed_workers.py", line 48, in spawned_train
    process_fn(config, device_id=device_id)
  File "/mnt/nfs/NMT/eole/eole/train_single.py", line 242, in main
    trainer.train(
  File "/mnt/nfs/NMT/eole/eole/trainer.py", line 337, in train
    eole.utils.distributed.all_gather_list(normalization)
AttributeError: module 'eole.utils' has no attribute 'distributed'

    self._shutdown_workers()
  File "/mnt/nfs/NMT/venv_eole_gpu4/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1443, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt: 
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)

solved it adding the import (from eole.utils import distributed) to trainer.py

Hope it helps,

cheers!

francoishernandez commented 1 day ago

Thanks for the report! It seems this issue was introduced in #116 which removed some explicit imports that hid the wonky behaviour. Would you mind opening a quick PR with your fix? Thanks!

isanvicente commented 1 day ago

Will do! thanks for the quick response!