Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input

RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library

https://recbole.io/

MIT License

3.31k stars 598 forks source link

Multi-GPU training is getting stuck in testing phase or throwing EOFError: Ran out of input #1811

Open diesel248 opened 1 year ago

diesel248 commented 1 year ago

Describe the bug

Distributed training is getting stuck in the testing phase after loading saved model or throwing the EOFError: Ran out of input by running the following command from source

python run_recbole.py --model=SASRec --loss_type=BPR --dataset=ml-100k --nproc=2 --gpu_id=0,1

Desktop (please complete the following information):

OS: Linux
- RecBole: 1.1.1
Python: 3.9.13
- PyTorch: 1.12.1
- cudatoolkit: 11.3.1

zhengbw0324 commented 1 year ago

Hello! @diesel248 I tried the same command as yours, but I didn't succeed in reproducing your problem. It is recommended that you download our latest code from github and refer to our documentation to try it out.

christopheralex commented 9 months ago

@zhengbw0324 i am facing the same issue. I have a single machine with 4 GPU. I am using the same command --nproc=4 --gpu_id='0,1,2,3'. Is there something I am missing ? If i use nproc=1 then the training happens only 1 GPU.