agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.1k stars 152 forks source link

How was ProtT5 trained? #148

Closed Sspandau closed 2 months ago

Sspandau commented 5 months ago

Dear Rostlab,

Do you have a notebook or code that shows how protT5_xl_uniref50 was trained?

mheinzinger commented 5 months ago

No, unfortunately, I have no working version of the old code anymore, sorry. It was based on this old version of T5 training code that is depreciated by now: https://github.com/google-research/text-to-text-transfer-transformer/tree/main

However, for my ProstT5 fine-tuning (which involved continued span-based pre-training using ProtT5 as starting point), I successfully used this one: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py Contrary to our original ProtT5 pre-training objective which was closer to BERT pre-training (corrupting always only spans of length=1 and reconstructing the full/unmasked sequence in the output instead of generating only those tokens that got replaced by spans), I simply followed the original T5 pre-training strategy given in the run_t5_mlm_flax.py linked above for continued pre-training of ProtT5 (which worked fine).

If you want to start training from scratch, maybe this repo helps (did not try, though): https://github.com/PiotrNawrot/nanoT5