LazarusNLP / IndoT5

T5 Language Models for the Indonesian Language!
Apache License 2.0
3 stars 0 forks source link

Does training new tokenizer effect the pre-trained model? #1

Open dinhngoc267 opened 2 months ago

dinhngoc267 commented 2 months ago

Hi,

It's nice to see your repository. I can see that you train a new tokenizer based on your new corpus, but I wonder does it change the token id of the original model base? If it does then it might effect the weight of original pre-train model, as you continue to pre-train the model from the pre-trained model right?

w11wo commented 1 month ago

Hi @dinhngoc267, very sorry for the late reply. I might have missed the notification for some reason.

To clarify, we did not continue training from a pre-trained model. IndoT5 was trained completely from scratch, with a new vocabulary/tokenizer. It is only when we fine-tuned to a downstream task like QA, summarization, etc., where we fine-tuned our model; but we kept the same vocabulary for this step.

Hope that clears it up.