Predictions are not generated via CLI on RTX3080 and cudatoolkit 11.0

ruslankotl commented 2 years ago

UPD: reducing batch size from 64 to 32 helps

I was trying out the T5Chem model by going through the tutorial proposed. While I managed to train the model via t5chem train --data_dir data/sample/product/ --output_dir model/ --task_type product --pretrain models/pretrain/simple/ --num_epoch 30, subsequent prediction of products via t5chem predict --data_dir data/sample/product/ --model_dir model/ resulted in prediction progress bar being stuck for 6 seconds and returning no predictions at all.

The dependencies I had to install myself: pytyhon=3.8 pytorch=1.7.1 with cudatoolkit=11.0 on RTX 3080

Instaliing CPU only pytorch, however, returns the predictions, as well as replicating this tutorial in python shell

Thank you for your help

HelloJocelynLu commented 2 years ago

Hi ruslankotl,

Thank you for reporting it. I built a brand new environment and install t5chem from scratch but still failed to reproduce the issue. May I know more details: Did you see the progress bar? or the progress bar does not even show at all? My progress bar is shown as below:

Singularity> t5chem predict --data_dir data/sample/product/ --model_dir model/                              
prediction:  62%|████████████████████████████████████▉                      | 10/16 [03:30<02:04, 20.79s/it]

My final results on this sample dataset:

Singularity> t5chem predict --data_dir data/sample/product/ --model_dir model/                              
prediction: 100%|███████████████████████████████████████████████████████████| 16/16 [05:24<00:00, 20.29s/it]
Top-1: 68.0% || Invalid 5.90%
Top-2: 76.8% || Invalid 13.65%
Top-3: 79.7% || Invalid 18.33%
Top-4: 81.3% || Invalid 21.62%
Top-5: 82.3% || Invalid 24.54%

Note that prediction is expected to proceed slower than training as it goes step wised.

ruslankotl commented 2 years ago

Hi, Thank you for getting back to me. I saw the progress bar but it was stuck at 0% and the script exited without any further messages after 6 seconds. No prediction file has been generated. t5chem predict --data_dir data/sample/product/ --model_dir model/ --num_preds 5 prediction: 0%| | 0/16 [00:06<?, ?it/s] Running CPU-only pytorch 1.7.1 did generate predictions but it took a long time. Attempts to use a newer version of pytorch with cudatoolkit>=11.1 resulted in a tokenization error. I suspect it may be a hardware issue, will try and run them on a Turing GPU to confirm.

Update: Turing GPU did not help

ruslankotl commented 2 years ago

Hi Jocelyn,

I have run the code through the debugger, and the problem was the generated RuntimeError: RuntimeError('CUDA out of memory. Tried to allocate 278.00 MiB (GPU 0; 9.78 GiB total capacity; 6.87 GiB already allocated; 220.56 MiB free; 8.03 GiB reserved in total by PyTorch)')

Setting --batch_size to 32 instead of default 64 seems to work.

Thank you for your help.

HelloJocelynLu / t5chem

Predictions are not generated via CLI on RTX3080 and cudatoolkit 11.0 #4