Closed sciencepal closed 3 years ago
Hey, what kind of dataset and task do you want to fine-tune the model on? Currently, the CLI interface and Colab notebook only supports fine-tuning on pre-defined tasks with fixed datasets. I'm planning to add support for specifying custom dataset too. Can you let me know your requirement specifically?
Let's keep this issue open until the feature is implemented.
Hi Divyanshu, I was planning on using BERT model to come up with a context-sensitive spelling error correction method for Indic Languages. I have not created or looked for a dataset yet, but I was wondering if there is any pre-defined task for this purpose. If not, I would like to train the BERT on a custom dataset along these lines: [CLS] sentence with incorrect word [SEP] -> Output is incorrect word and possible list of replaceble words based on context.
To the best of my knowledge, there is not such a dataset. If you find one, please do add it to the Indic NLP Catalog.
BTE- check this: http://jkhighereducation.nic.in/jkrjmcs/issue1/15.pdf
Hey, I am working on adding support for specifying custom dataset. Will revert in a couple of days
For spelling correction, maybe you can try https://github.com/R1j1t/contextualSpellCheck. It does exactly what you want, by suggesting contextual spellings based on a Bert model(see the model suggestions below).
The library works by finding misspelled words using heuristics - words not in vocab, not named entity, etc(see pic below). Might need tweaking for getting named entities as spacy does not have ner models for Indian languages.
Once the misspelled words are identified, they are masked and the Bert model(this can be any pretrained hugging face model particular to the lang like Indic-Bert) is used to generate candidates. Out of the candidates, the ones with the least edit distance are selected for displaying to the user.
@gowtham1997 Thanks a lot. This is exactly what I had in mind. I will definitely look into this !!
Hi All,
Is there any documentation to train the model on our own dataset ... I wanted to know what dataset format the model takes and how we can fine-tune the model based on that dataset.