tombosc commented 3 weeks ago

Hello,

Firstly, thanks you all for your work.

I am struggling to understand how to fine-tune T5.

In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.

(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]

113 also references another answer from #137 which is strange:

no pad token (problem, because then the first token is not modelled)
no eos token at all (problem in the decoder, because end of sequence token is not modelled)
the masked token embeddings have the same ID

There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.

Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):

Input: E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>.
label: <pad> E V Q L V E S G A E </s>.

Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.

edit: Another question: does the tokenizer include a postprocessor? It seems not: (Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'. Does it mean all those extra tokens need to be added manually, before calling tokenizer()?

mheinzinger commented 3 weeks ago

I have to apologize for inconsistent/missing documentation on pre-training ProtT5. We will try to improve documentation in the future to avoid wasting people's time. Let me clarify:

regarding EOS: as you pointed out: there is only a single EOS token </s> which I used recently successfully for finetuning by appending it to both, input sequence and output sequence.
no pad token: as you pointed out, without the pad token, there is no way to model the very first token. This can be acceptable if you aim for representation learning (effectively dropping the decoder after finetuning anyway) but this is absolutely inacceptable if you want to use the model for generation purpose. This is also why I added the <pad> token as a very first token to the decoder input when doing ProstT5 finetuning . So irrespective of the original ProtT5 pre-training, I would add it if you need it for your use-case, i.e., if you aim for generative capability.
That the mask tokens always have the same ID only works because a) we set mask-length=1 during ProtT5 pre-training and, b) we always model the full sequence in the decoder. This way, there is no need to tell the model which mask-token to fill at a specific generation-step because the model simply always models the full sequence and each mask token corresponds to exactly one token in the prediction (no collapsing of multiple tokens into a single span).
Re postprocessor: I simply took the huggingface T5-pre-training example and used the dataloader from there: https://github.com/mheinzinger/ProstT5/blob/main/scripts/pretraining_scripts/pretraining_stage1_MLM.py . So in summary, if you want to stick closely to the original pre-training, you can use:

Input: E V <extra_id_0> L <extra_id_0> E S G <extra_id_0> E </s> label: <pad> E V Q L V E S G A E </s>

However, depending on the amount of data you got available, you can also pre-train on the original T5 pre-training task (span-corruption with span-lengths>1). I did this successfully in ProstT5 finetuning (See link above). one thing I would recommend if you do not aim to use your finetuned model for generation: simply take the ProtT5-encoder-only part and use the huggingface MLM-example with MASK-token=

tombosc commented 3 weeks ago

Thank you for the quick and complete answer! And could you confirm that I am correct about the postprocessor please (cf last paragraph of my message)?

agemagician / ProtTrans

T5 fine-tuning special tokens #158

113 also references another answer from #137 which is strange: