RunTimeError related with NCCL when training librimix recipe

garcesote commented 10 months ago

Hi,

I'm trying to train the librimix recipe code and I'm getting the same error when I try to use my GPU for training:

RuntimeError("Distributed package doesn't have NCCL " "built in")

torch.cuda.current_device() is returning a GPU named 0 in my python but when I enter it like this:

./run.sh --stage 2 --id 0

to train my model with the gpu it returns that runtime error.

Is it necessary to have NCCL in my systemto train the example? Or is it only that I'm making an error in the process of training.

This is my complete output in order anyone can help me:

Results from the following experiment will be stored in exp/train_convtasnet_4a19572d Stage 2: Training GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.). {'data': {'n_src': 2, 'sample_rate': 8000, 'segment': 3, 'task': 'sep_clean', 'train_dir': 'data/wav8k/min/metadata/train-360', 'valid_dir': 'data/wav8k/min/metadata/dev'}, 'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8}, 'main_args': {'exp_dir': 'exp/train_convtasnet_4a19572d', 'help': None}, 'masknet': {'bn_chan': 128, 'hid_chan': 512, 'mask_act': 'relu', 'n_blocks': 8, 'n_repeats': 3, 'skip_chan': 128}, 'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0}, 'positional arguments': {}, 'training': {'batch_size': 24, 'early_stop': True, 'epochs': 200, 'half_lr': True, 'num_workers': 4}} Drop 0 utterances from 50800 (shorter than 3 seconds) Drop 0 utterances from 3000 (shorter than 3 seconds) Traceback (most recent call last): File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 143, in main(arg_dic) File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 109, in main trainer.fit(system) File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 532, in fit call._call_and_handle_interrupt( File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\call.py", line 42, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 93, in launch return function(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 571, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 938, in _run self.strategy.setup_environment() File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 143, in setup_environment self.setup_distributed() File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 191, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\lightning_fabric\utilities\distributed.py", line 258, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in

Thank you in advance.

mpariente commented 10 months ago

Have you looked for this bug somewhere else ? It doesn't seem to be related to Asteroid.

garcesote commented 10 months ago

https://discuss.pytorch.org/t/runtimeerror-distributed-package-doesnt-have-nccl-built-in/176744

Reading the link, it seems that in the process of training with my GPU, the recipe is trying to use NCCL. However I'm training the model on Windows where it's not possible to work with NCCL. ¿Any ideas how can I solve this, do I have to try it in another OS or there's a way of training it without NCCL?

mpariente commented 10 months ago

I'm sorry but I have no idea.

asteroid-team / asteroid

RunTimeError related with NCCL when training librimix recipe #683