facebookresearch / fastMRI

A large-scale dataset of both raw MRI measurements and clinical MRI images.
https://fastmri.org
MIT License
1.29k stars 370 forks source link

Something wrong with the NCCL backend. #177

Closed NayeeC closed 2 years ago

NayeeC commented 2 years ago

Hi,

I just tried to run the varnet_brain_leaderboard_20201111.py to give the dataset another training. But after I run the code, an error occurred like below:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Missing logger folder: varnet\brain_leaderboard\lightning_logs
WARNING:lightning:Missing logger folder: varnet\brain_leaderboard\lightning_logs
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Traceback (most recent call last):
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 200, in <module>
    run_cli()
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 196, in run_cli
    cli_main(args)
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 74, in cli_main
    trainer.fit(model, datamodule=data_module)
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 445, in fit
    results = self.accelerator_backend.train()
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 183, in init_ddp_connection
    torch_distrib.init_process_group(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 523, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 625, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in
Traceback (most recent call last):
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 200, in <module>
    run_cli()
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 196, in run_cli
    cli_main(args)
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 74, in cli_main
    trainer.fit(model, datamodule=data_module)
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 445, in fit
    results = self.accelerator_backend.train()
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 183, in init_ddp_connection
    torch_distrib.init_process_group(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 523, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 625, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in
Traceback (most recent call last):
Traceback (most recent call last):
  File "varnet_brain_leaderboard_20201111.py", line 200, in <module>
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 200, in <module>
        run_cli()run_cli()

  File "varnet_brain_leaderboard_20201111.py", line 196, in run_cli
  File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 196, in run_cli
    cli_main(args)
cli_main(args)  File "varnet_brain_leaderboard_20201111.py", line 74, in cli_main

      File "F:\ch\fastMRI\fastmri_examples\varnet\varnet_brain_leaderboard_20201111.py", line 74, in cli_main
trainer.fit(model, datamodule=data_module)
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 445, in fit
trainer.fit(model, datamodule=data_module)
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 445, in fit
    results = self.accelerator_backend.train()
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 148, in train
results = self.accelerator_backend.train()
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 238, in ddp_train
results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 183, in init_ddp_connection
self.init_ddp_connection(
torch_distrib.init_process_group(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 183, in init_ddp_connection
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 523, in init_process_group
    torch_distrib.init_process_group(
      File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 523, in init_process_group
default_pg = _new_process_group_helper(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 625, in _new_process_group_helper
    default_pg = _new_process_group_helper(
  File "D:\Anaconda3\envs\chtorch2\lib\site-packages\torch\distributed\distributed_c10d.py", line 625, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError    : Distributed package doesn't have NCCL built inraise RuntimeError("Distributed package doesn't have NCCL "

RuntimeError: Distributed package doesn't have NCCL built in

Since I run the code under Windows, so I modify the backend='ddp' in the `bulid_args()' into the 'backend=None', but the error still existed.

How can I run the training process under the Windows system?

mmuckley commented 2 years ago

Hello @NayeeC, I'm not sure this is possible. We only test the code on Linux. Issues with other libraries are beyond things we can help with. I would raise an issue on the PyTorch Lightning repository about multi-GPU training.

mmuckley commented 2 years ago

Closing due to out of scope.