Extending IndicBert V2 - Githubissues

AI4Bharat / IndicBERT

Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME

https://ai4bharat.iitm.ac.in/language-understanding

MIT License

73 stars 13 forks source link

Extending IndicBert V2 #1

Closed singhakr closed 1 year ago

singhakr commented 1 year ago

I have some data for three low resource languages, two of them are not in the list of 24 languages of IndicBERT V2 and for one I may have some more data. I want to continue training on this data from the V2 checkpoint. There is a pre-training script for IndicBERT. Could you please help me with how to do this? In particular, I am not clear about how to use the language codes to continue pre-training. My purpose is to use it for MT as well as for some basic NLP tasks like POS tagging, NER etc.

Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before.

sumanthd17 commented 1 year ago

Thanks for your Q.

You can prepend a language tag to the sentences you have and continue training. We have not released the original TF ckpts, but only the HF ckpts. I think you should be able to use the this code to try further fine-tuning

WRT the Q on translation, I'm not fully sure how you plan to use this model for translation. This is an encoder-only model and will not be able to do any generation (unless you are referring to initialising the decoder weights also with encoder weights).

singhakr commented 1 year ago

For MT, I have something like this in mind:

https://jlibovicky.github.io/2020/03/05/MT-Weekly-BERT-for-MT.html

https://aclanthology.org/D19-5611.pdf

It might not be as good as using,say, mBART for MT, but it might be a good baseline. What is your opinion?

Thanks for the info about how to use language code to continue training.

sumanthd17 commented 1 year ago

This looks interesting, I have not tried this before so I don't have an opinion on this.

But if you do this experiment, please do share the results with us.

Happy to help if you have any further questions with the model.

rahular commented 1 year ago

You can actually load encoder-only checkpoints into encoder-decoder models on HF like this. But note that the cross-attentions are initialized randomly and you need a decent amount of data to train them.

singhakr commented 1 year ago

Yes,it makes sense. Thanks for the link.

singhakr commented 1 year ago

I was busy for a travel for the last few days.

Sure, I will share the results with you once I have tried this out.

On Mon, Feb 6, 2023, 5:14 PM Sumanth Doddapaneni @.***> wrote:

This looks interesting, I have not tried this before so I don't have an opinion on this.

But if you do this experiment, please do share the results with us.

Happy to help if you have any further questions with the model.

— Reply to this email directly, view it on GitHub https://github.com/AI4Bharat/IndicBERT/issues/1#issuecomment-1418950768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQUQJQG2TCFZRZJXKA6XULWWDPZJANCNFSM6AAAAAAUO7KQME . You are receiving this because you authored the thread.Message ID: @.***>