agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

Fine-tuning on multiple GPUs in parralle mode #63

Closed denysm closed 1 year ago

denysm commented 3 years ago

Hello, thanks a lot for your great work! Question is about ProtTrans/Fine-Tuning/ProtBert-BFD-FineTuning-PyTorchLightning-Localization.ipynb It took ~8 hours of computer time to rerun this jupyter notebook on 24Gb P40 node on not so large training dataset you provided of ~3k examples To fine tune on larger datasets with 100k examples - it would require considerably more time...

Is there a way to run fine tuning process on parallel gpus - for example to use 4 or 8 gpus in parallel? 1) What one needs to change in the code to run it on multiple gpus? 2) What is the largest number of categories that you have fine tuned the model for?

Thanks, Denys

denysm commented 2 years ago

there is a line in the code that specifies the number of GPUS;

gpu/tpu args

parser.add_argument("--gpus", type=int, default=1, help="How many gpus") When changing this parameter to for example 4 it can see all 4 gpus allocated by import os os.environ['CUDA_VISIBLE_DEVICES'] = '4,5,6,7'

however, only one gpu is used for training :

------------------------

6 START TRAINING

------------------------

trainer.fit(model)

Could you clarify as well variable

Batching

parser.add_argument( "--batch_size", default=1, type=int, help="Batch size to be used." I've got the error when changing batch size to larger numbers

ratthachat commented 2 years ago

Hi, you can also try apex-mixed precision with RTX GPU. FYI, when I use 1GPU of RTX3090 with apex-mixed precision, I got around 5-6x speed up compared to P100 (full precision) of Colab pro/Kaggle notebook with batch_size=1, so I could finish one epoch in few minutes.

The other option you can do is to reduce the Protein string length. In the example notebook, default length is 1536. You can try limit it to 800 which still cover around 80% of data for example.