Closed ruslankotl closed 2 years ago
Hi ruslankotl,
Thank you for reporting it. I built a brand new environment and install t5chem from scratch but still failed to reproduce the issue. May I know more details: Did you see the progress bar? or the progress bar does not even show at all? My progress bar is shown as below:
Singularity> t5chem predict --data_dir data/sample/product/ --model_dir model/
prediction: 62%|████████████████████████████████████▉ | 10/16 [03:30<02:04, 20.79s/it]
My final results on this sample dataset:
Singularity> t5chem predict --data_dir data/sample/product/ --model_dir model/
prediction: 100%|███████████████████████████████████████████████████████████| 16/16 [05:24<00:00, 20.29s/it]
Top-1: 68.0% || Invalid 5.90%
Top-2: 76.8% || Invalid 13.65%
Top-3: 79.7% || Invalid 18.33%
Top-4: 81.3% || Invalid 21.62%
Top-5: 82.3% || Invalid 24.54%
Note that prediction is expected to proceed slower than training as it goes step wised.
Hi,
Thank you for getting back to me. I saw the progress bar but it was stuck at 0% and the script exited without any further messages after 6 seconds. No prediction file has been generated.
t5chem predict --data_dir data/sample/product/ --model_dir model/ --num_preds 5 prediction: 0%| | 0/16 [00:06<?, ?it/s]
Running CPU-only pytorch 1.7.1 did generate predictions but it took a long time.
Attempts to use a newer version of pytorch with cudatoolkit>=11.1 resulted in a tokenization error.
I suspect it may be a hardware issue, will try and run them on a Turing GPU to confirm.
Update: Turing GPU did not help
Hi Jocelyn,
I have run the code through the debugger, and the problem was the generated RuntimeError:
RuntimeError('CUDA out of memory. Tried to allocate 278.00 MiB (GPU 0; 9.78 GiB total capacity; 6.87 GiB already allocated; 220.56 MiB free; 8.03 GiB reserved in total by PyTorch)')
Setting --batch_size
to 32 instead of default 64 seems to work.
Thank you for your help.
UPD: reducing batch size from 64 to 32 helps
I was trying out the T5Chem model by going through the tutorial proposed. While I managed to train the model via
t5chem train --data_dir data/sample/product/ --output_dir model/ --task_type product --pretrain models/pretrain/simple/ --num_epoch 30
, subsequent prediction of products viat5chem predict --data_dir data/sample/product/ --model_dir model/
resulted in prediction progress bar being stuck for 6 seconds and returning no predictions at all.The dependencies I had to install myself: pytyhon=3.8 pytorch=1.7.1 with cudatoolkit=11.0 on RTX 3080
Instaliing CPU only pytorch, however, returns the predictions, as well as replicating this tutorial in python shell
Thank you for your help