Open tombosc opened 3 weeks ago
I have to apologize for inconsistent/missing documentation on pre-training ProtT5. We will try to improve documentation in the future to avoid wasting people's time. Let me clarify:
</s>
which I used recently successfully for finetuning by appending it to both, input sequence and output sequence.<pad>
token as a very first token to the decoder input when doing ProstT5 finetuning . So irrespective of the original ProtT5 pre-training, I would add it if you need it for your use-case, i.e., if you aim for generative capability.Re postprocessor: I simply took the huggingface T5-pre-training example and used the dataloader from there: https://github.com/mheinzinger/ProstT5/blob/main/scripts/pretraining_scripts/pretraining_stage1_MLM.py . So in summary, if you want to stick closely to the original pre-training, you can use:
Input: E V <extra_id_0> L <extra_id_0> E S G <extra_id_0> E </s>
label: <pad> E V Q L V E S G A E </s>
However, depending on the amount of data you got available, you can also pre-train on the original T5 pre-training task (span-corruption with span-lengths>1). I did this successfully in ProstT5 finetuning (See link above).
one thing I would recommend if you do not aim to use your finetuned model for generation: simply take the ProtT5-encoder-only part and use the huggingface MLM-example with MASK-token=
Thank you for the quick and complete answer! And could you confirm that I am correct about the postprocessor please (cf last paragraph of my message)?
Hello,
Firstly, thanks you all for your work.
I am struggling to understand how to fine-tune T5.
In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.
113 also references another answer from #137 which is strange:
There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.
Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):
E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>
.<pad> E V Q L V E S G A E </s>
.Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.
edit: Another question: does the tokenizer include a postprocessor? It seems not:
(Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'
. Does it mean all those extra tokens need to be added manually, before callingtokenizer()
?