Documentation to implement model on an external dataset

AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT

https://indicnlp.ai4bharat.org

MIT License

276 stars 41 forks source link

Documentation to implement model on an external dataset #8

Closed sciencepal closed 3 years ago

sciencepal commented 3 years ago

Hi All,

Is there any documentation to train the model on our own dataset ... I wanted to know what dataset format the model takes and how we can fine-tune the model based on that dataset.

divkakwani commented 3 years ago

Hey, what kind of dataset and task do you want to fine-tune the model on? Currently, the CLI interface and Colab notebook only supports fine-tuning on pre-defined tasks with fixed datasets. I'm planning to add support for specifying custom dataset too. Can you let me know your requirement specifically?

Let's keep this issue open until the feature is implemented.

sciencepal commented 3 years ago

Hi Divyanshu, I was planning on using BERT model to come up with a context-sensitive spelling error correction method for Indic Languages. I have not created or looked for a dataset yet, but I was wondering if there is any pre-defined task for this purpose. If not, I would like to train the BERT on a custom dataset along these lines: [CLS] sentence with incorrect word [SEP] -> Output is incorrect word and possible list of replaceble words based on context.

anoopkunchukuttan commented 3 years ago

To the best of my knowledge, there is not such a dataset. If you find one, please do add it to the Indic NLP Catalog.

You can check Indian language Wikipedia's revision history to check for instances of correction of spellings and try mining data from that.
You can try to create noisy corpora from the clean corpus by applying common spelling mistake patterns. You can see if there is some analysis of common spelling mistakes in Indian languages. Some common mistakes I can think of are:
- badi/choti maatra, using pipe symbol for poorna virama, missing nukta, using colon instead of visaraga

BTE- check this: http://jkhighereducation.nic.in/jkrjmcs/issue1/15.pdf

divkakwani commented 3 years ago

Hey, I am working on adding support for specifying custom dataset. Will revert in a couple of days

gowtham1997 commented 3 years ago

For spelling correction, maybe you can try https://github.com/R1j1t/contextualSpellCheck. It does exactly what you want, by suggesting contextual spellings based on a Bert model(see the model suggestions below).

The library works by finding misspelled words using heuristics - words not in vocab, not named entity, etc(see pic below). Might need tweaking for getting named entities as spacy does not have ner models for Indian languages.

image (1)

Once the misspelled words are identified, they are masked and the Bert model(this can be any pretrained hugging face model particular to the lang like Indic-Bert) is used to generate candidates. Out of the candidates, the ones with the least edit distance are selected for displaying to the user.

sciencepal commented 3 years ago

@gowtham1997 Thanks a lot. This is exactly what I had in mind. I will definitely look into this !!