RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.37k stars 606 forks source link

[🐛BUG] full_sort_scores cuda error #1964

Open IFShirokikh opened 8 months ago

IFShirokikh commented 8 months ago

Describe the bug I am able to train the model, but I cannot get predictions on the test sample.

To Reproduce I'm attaching to https://drive.google.com/drive/folders/1YLS0R41sWbDvL3_CxEsSmc9n0UbNXwbH:

  1. "hh.yaml"
  2. jupyter notebook "Recbole example.ipynb" with error (I stopped training after 1 epoch to reproduce the error faster)
  3. data for training: "hh_recbole"
  4. saved model: "saved"

Expected behavior I wanted to reproduce https://recbole.io/docs/user_guide/usage/case_study.html

Screenshots

image image

Desktop:

IFShirokikh commented 8 months ago

Further restarts of the error cell lead to the following result:

image
IFShirokikh commented 8 months ago

A similar error occurred during several epochs when the model tried to load the last most successful attempt. Therefore, the problem has become critical - it is impossible not to train or test the model train log.txt

BoXiaohe commented 8 months ago

Thanks for your attention to RecBole! As for your problem, you can try advice below.

  1. CUDA Compatibility: Ensure that your GPU is CUDA-compatible and check if your GPU is listed in the official PyTorch CUDA support documentation https://pytorch.org/get-started/previous-versions/.
  2. PyTorch Installation: Verify that you have installed the correct version of PyTorch that corresponds to your CUDA version. Hope this could help you!