RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.37k stars 606 forks source link

[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

Closed jimmy-academia closed 7 months ago

jimmy-academia commented 8 months ago

Describe the bug torch.distributed.barrier() causes errors for single GPU

To Reproduce

from recbole.config import Config
from recbole.data import create_dataset
dataset = 'ml-100k'
config = Config(model='LightGCN', dataset=dataset)
config['data_path'] = 'cache_data/ml-100k/raw'
dataset = create_dataset(config)

Expected behavior A clear and concise description of what you expected to happen.

    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

caused by

site-packages/recbole/data/dataset/dataset.py", line 251, in _download
    torch.distributed.barrier()

After commenting out line 251 in recbole/data/dataset/dataset.py, there is no error.

Desktop (please complete the following information):

(single GPU)

Yilu114 commented 7 months ago

Issue Description: The torch.distributed.barrier() function causes errors when running RecBole on a single GPU. After commenting out the line containing torch.distributed.barrier() in the recbole/data/dataset/dataset.py file, the error disappears.

Root Cause: The issue arises because torch.distributed.barrier() is intended for synchronizing multiple processes, commonly used in multi-GPU or distributed training scenarios. When running on a single GPU, calling this function results in an error because the default process group is not initialized.

Solution: Since you are running on a single GPU, consider the following steps:

  1. Comment Out the Barrier Call:

    • In the recbole/data/dataset/dataset.py file, locate line 251 containing torch.distributed.barrier().
    • Comment out this line to avoid the error. However, be aware that this may impact other parts of the code that rely on barrier synchronization.
  2. Further Investigation:

    • To understand the root cause more thoroughly, you can explore the RecBole source code, specifically the relevant section in recbole/data/dataset/dataset.py.
    • Check if there are any other places where torch.distributed.barrier() is called and evaluate whether they are necessary for your specific use case.

Documentation and References:

Feel free to explore the provided documentation and adapt the solution to your specific needs! 🚀