Error encountered while training IndicBert on cvit-mkb dataset

AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT

https://indicnlp.ai4bharat.org

MIT License

276 stars 41 forks source link

Error encountered while training IndicBert on cvit-mkb dataset #3

Closed tushar117 closed 4 years ago

tushar117 commented 4 years ago

Hi Authors, First of all, It's great to have GLUE benchmarks for Indian languages, and thanks for the great effort.

Problem faced: While executing the code for cvit-mkb (mann-ki-baat) dataset, I faced an issue. It seems like the ManKiBaat Dataset processor module doesn't have the method named: get_labels due to which the code terminates.

To get the inference results, I have used the following command: argvec = ['--lang', 'hi', '--dataset', 'cvit-mkb', '--model', 'ai4bharat/indic-bert', '--iglue_dir', '../indic-glue', '--output_dir', '../outputs', '--max_seq_length', '128', '--learning_rate', '2e-5', '--num_train_epochs', '3', '--train_batch_size', '32' ]

finetune_main(argvec)`

I have another concern regarding the cvit-mkb dataset for cross-lingual sentence retrieval. In the ManKiBaat Dataset processor module only supports 'en' and 'in' modes but there is no such Unicode available for any of the languages mentioned in the IndicNLPSuite paper.

I am also attaching the error logs which I encountered during the inference: cvit-mkb-error

divkakwani commented 4 years ago

Hey Tushar, thanks for pointing it out. It seems there was a bug that recently cropped up in the code while we were doing some refactoring.

We have fixed the issue you mentioned. Regarding your concern for mode, it is basically a parameter that we use to specify which part of the dataset to use. For example, modes can be train, dev or test, and in case of CVIT-MKB, which only has a test set, we overloaded the concept of mode to specify which side of the parallel corpora to use, English or Indic. As for the language you want to use, it is specified through the lang parameter that is inserted in argvec in the colab notebook.

Feel free to close this issue if the latest commits have resolved your concern.

tushar117 commented 4 years ago

Thank Divyanshu for fixing this issue and clarifying the doubt related to model parameters. Are you planning to create documentation for Indic-BERT code, as it would be great to have for special cases like this?

divkakwani commented 4 years ago

Yup, we are planning to add the documentation. Feel free to mention your use cases in issue #4 so that we can tailor the documentation as per your need. It might take some time from our end since I'm currently a bit occupied with some other work.

Thanks again for pointing out the issue. I'd be glad to hear about more bugs from you :).