Closed singhakr closed 1 year ago
Thanks for your Q.
You can prepend a language tag
WRT the Q on translation, I'm not fully sure how you plan to use this model for translation. This is an encoder-only model and will not be able to do any generation (unless you are referring to initialising the decoder weights also with encoder weights).
For MT, I have something like this in mind:
https://jlibovicky.github.io/2020/03/05/MT-Weekly-BERT-for-MT.html
https://aclanthology.org/D19-5611.pdf
It might not be as good as using,say, mBART for MT, but it might be a good baseline. What is your opinion?
Thanks for the info about how to use language code to continue training.
This looks interesting, I have not tried this before so I don't have an opinion on this.
But if you do this experiment, please do share the results with us.
Happy to help if you have any further questions with the model.
You can actually load encoder-only checkpoints into encoder-decoder models on HF like this. But note that the cross-attentions are initialized randomly and you need a decent amount of data to train them.
Yes,it makes sense. Thanks for the link.
I was busy for a travel for the last few days.
Sure, I will share the results with you once I have tried this out.
On Mon, Feb 6, 2023, 5:14 PM Sumanth Doddapaneni @.***> wrote:
This looks interesting, I have not tried this before so I don't have an opinion on this.
But if you do this experiment, please do share the results with us.
Happy to help if you have any further questions with the model.
— Reply to this email directly, view it on GitHub https://github.com/AI4Bharat/IndicBERT/issues/1#issuecomment-1418950768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQUQJQG2TCFZRZJXKA6XULWWDPZJANCNFSM6AAAAAAUO7KQME . You are receiving this because you authored the thread.Message ID: @.***>
I have some data for three low resource languages, two of them are not in the list of 24 languages of IndicBERT V2 and for one I may have some more data. I want to continue training on this data from the V2 checkpoint. There is a pre-training script for IndicBERT. Could you please help me with how to do this? In particular, I am not clear about how to use the language codes to continue pre-training. My purpose is to use it for MT as well as for some basic NLP tasks like POS tagging, NER etc.
Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before.