Closed jimmy-academia closed 7 months ago
Issue Description:
The torch.distributed.barrier()
function causes errors when running RecBole on a single GPU. After commenting out the line containing torch.distributed.barrier()
in the recbole/data/dataset/dataset.py
file, the error disappears.
Root Cause:
The issue arises because torch.distributed.barrier()
is intended for synchronizing multiple processes, commonly used in multi-GPU or distributed training scenarios. When running on a single GPU, calling this function results in an error because the default process group is not initialized.
Solution: Since you are running on a single GPU, consider the following steps:
Comment Out the Barrier Call:
recbole/data/dataset/dataset.py
file, locate line 251 containing torch.distributed.barrier()
.Further Investigation:
recbole/data/dataset/dataset.py
.torch.distributed.barrier()
is called and evaluate whether they are necessary for your specific use case.Documentation and References:
Feel free to explore the provided documentation and adapt the solution to your specific needs! 🚀
Describe the bug torch.distributed.barrier() causes errors for single GPU
To Reproduce
Expected behavior A clear and concise description of what you expected to happen.
caused by
After commenting out line 251 in
recbole/data/dataset/dataset.py
, there is no error.Desktop (please complete the following information):
(single GPU)