agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

T5 fine-tuning special tokens #158

Open tombosc opened 3 weeks ago

tombosc commented 3 weeks ago

Hello,

Firstly, thanks you all for your work.

I am struggling to understand how to fine-tune T5.

In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.

(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]

113 also references another answer from #137 which is strange:

There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.

Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):

Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.

edit: Another question: does the tokenizer include a postprocessor? It seems not: (Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'. Does it mean all those extra tokens need to be added manually, before calling tokenizer()?

mheinzinger commented 3 weeks ago

I have to apologize for inconsistent/missing documentation on pre-training ProtT5. We will try to improve documentation in the future to avoid wasting people's time. Let me clarify:

However, depending on the amount of data you got available, you can also pre-train on the original T5 pre-training task (span-corruption with span-lengths>1). I did this successfully in ProstT5 finetuning (See link above). one thing I would recommend if you do not aim to use your finetuned model for generation: simply take the ProtT5-encoder-only part and use the huggingface MLM-example with MASK-token=

tombosc commented 3 weeks ago

Thank you for the quick and complete answer! And could you confirm that I am correct about the postprocessor please (cf last paragraph of my message)?