RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.48k stars 615 forks source link

[🐛BUG] full_sort_scores cuda error #1964

Open IFShirokikh opened 10 months ago

IFShirokikh commented 10 months ago

Describe the bug I am able to train the model, but I cannot get predictions on the test sample.

To Reproduce I'm attaching to https://drive.google.com/drive/folders/1YLS0R41sWbDvL3_CxEsSmc9n0UbNXwbH:

  1. "hh.yaml"
  2. jupyter notebook "Recbole example.ipynb" with error (I stopped training after 1 epoch to reproduce the error faster)
  3. data for training: "hh_recbole"
  4. saved model: "saved"

Expected behavior I wanted to reproduce https://recbole.io/docs/user_guide/usage/case_study.html

Screenshots

image image

Desktop:

IFShirokikh commented 10 months ago

Further restarts of the error cell lead to the following result:

image
IFShirokikh commented 10 months ago

A similar error occurred during several epochs when the model tried to load the last most successful attempt. Therefore, the problem has become critical - it is impossible not to train or test the model train log.txt

BoXiaohe commented 10 months ago

Thanks for your attention to RecBole! As for your problem, you can try advice below.

  1. CUDA Compatibility: Ensure that your GPU is CUDA-compatible and check if your GPU is listed in the official PyTorch CUDA support documentation https://pytorch.org/get-started/previous-versions/.
  2. PyTorch Installation: Verify that you have installed the correct version of PyTorch that corresponds to your CUDA version. Hope this could help you!