Open garcesote opened 10 months ago
Have you looked for this bug somewhere else ? It doesn't seem to be related to Asteroid.
https://discuss.pytorch.org/t/runtimeerror-distributed-package-doesnt-have-nccl-built-in/176744
Reading the link, it seems that in the process of training with my GPU, the recipe is trying to use NCCL. However I'm training the model on Windows where it's not possible to work with NCCL. ¿Any ideas how can I solve this, do I have to try it in another OS or there's a way of training it without NCCL?
I'm sorry but I have no idea.
Hi,
I'm trying to train the librimix recipe code and I'm getting the same error when I try to use my GPU for training:
RuntimeError("Distributed package doesn't have NCCL " "built in")
torch.cuda.current_device() is returning a GPU named 0 in my python but when I enter it like this:
./run.sh --stage 2 --id 0
to train my model with the gpu it returns that runtime error.
Is it necessary to have NCCL in my systemto train the example? Or is it only that I'm making an error in the process of training.
This is my complete output in order anyone can help me:
Results from the following experiment will be stored in exp/train_convtasnet_4a19572d Stage 2: Training GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs
main(arg_dic)
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 109, in main
trainer.fit(system)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 93, in launch
return function(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 938, in _run
self.strategy.setup_environment()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 143, in setup_environment
self.setup_distributed()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 191, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\lightning_fabric\utilities\distributed.py", line 258, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
Trainer(limit_train_batches=1.0)
was configured so 100% of the batches per epoch will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.). {'data': {'n_src': 2, 'sample_rate': 8000, 'segment': 3, 'task': 'sep_clean', 'train_dir': 'data/wav8k/min/metadata/train-360', 'valid_dir': 'data/wav8k/min/metadata/dev'}, 'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8}, 'main_args': {'exp_dir': 'exp/train_convtasnet_4a19572d', 'help': None}, 'masknet': {'bn_chan': 128, 'hid_chan': 512, 'mask_act': 'relu', 'n_blocks': 8, 'n_repeats': 3, 'skip_chan': 128}, 'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0}, 'positional arguments': {}, 'training': {'batch_size': 24, 'early_stop': True, 'epochs': 200, 'half_lr': True, 'num_workers': 4}} Drop 0 utterances from 50800 (shorter than 3 seconds) Drop 0 utterances from 3000 (shorter than 3 seconds) Traceback (most recent call last): File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 143, inThank you in advance.