AI4Bharat / IndicLID

Language Identification for Indian languages
12 stars 4 forks source link

IndicLID

Website | Downloads | Demo

IndicLID, is a language identifier for all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. IndicLID is the first LID for romanized text in Indian languages. It is a two stage classifier that is ensemble of a fast linear classifier and a slower classifier finetuned from a pre-trained LM. It can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others). All the classes are listed below.

Languages Supported

Language IndicLID Code
Assamese (Bengali script) asm_Beng
Assamese (Latin script) asm_Latn
Bangla (Bengali script) ben_Beng
Bangla (Latin script) ben_Latn
Bodo (Devanagari script) brx_Deva
Bodo (Latin script) brx_Latn
Dogri (Devanagari script) doi_Deva
Dogri (Latin script) doi_Latn
English (Latin script) eng_Latn
Gujarati (Gujarati script) guj_Gujr
Gujarati (Latin script) guj_Latn
Hindi (Devanagari script) hin_Deva
Hindi (Latin script) hin_Latn
Kannada (Kannada script) kan_Knda
Kannada (Latin script) kan_Latn
Kashmiri (Perso_Arabic script) kas_Arab
Kashmiri (Devanagari script) kas_Deva
Kashmiri (Latin script) kas_Latn
Konkani (Devanagari script) kok_Deva
Konkani (Latin script) kok_Latn
Maithili (Devanagari script) mai_Deva
Maithili (Latin script) mai_Latn
Malayalam (Malayalam script) mal_Mlym
Malayalam (Latin script) mal_Latn
Manipuri (Bengali script) mni_Beng
Manipuri (Meetei_Mayek script) mni_Meti
Manipuri (Latin script) mni_Latn
Marathi (Devanagari script) mar_Deva
Marathi (Latin script) mar_Latn
Nepali (Devanagari script) nep_Deva
Nepali (Latin script) nep_Latn
Oriya (Oriya script) ori_Orya
Oriya (Latin script) ori_Latn
Punjabi (Gurmukhi script) pan_Guru
Punjabi (Latin script) pan_Latn
Sanskrit (Devanagari script) san_Deva
Sanskrit (Latin script) san_Latn
Santali (Ol_Chiki script) sat_Olch
Sindhi (Perso_Arabic script) snd_Arab
Sindhi (Latin script) snd_Latn
Tamil (Tamil script) tam_Tamil
Tamil (Latin script) tam_Latn
Telugu (Telugu script) tel_Telu
Telugu (Latin script) tel_Latn
Urdu (Perso_Arabic script) urd_Arab
Urdu (Latin script) urd_Latn
Other other

Evaluation Results

IndicLID is evaluated on Bhasha-Abhijnaanam benchmark which is released alnog with this work. For native-script text, IndicLID has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID model is 10 times faster and 4 times smaller than the NLLB model also establish a strong baseline results on the roman-script text. For more details, refer our paper.

Native LID Results

Following table compares IndicLID-FTN model with the NLLB model and the CLD3 model. We restrict the comparison to languages that are common with IndicLID (count of common languages is indicated in brackets). Throughput is number of sentence/second.

Model Precison Recall F1-score Accuracy Throughput Model Size
IndicLID-FTN-8-dim (24) 0.98 0.99 0.98 0.98 30,303 318M
IndicLID-FTN-4-dim (12) 0.99 0.98 0.99 0.98 47,619 208M
IndicLID-FTN-8-dim (12) 1.00 0.99 0.99 0.98 33,333 318M
CLD3 (12) 0.99 0.98 0.98 0.98 4,861 -
IndicLID-FTN-4-dim (20) 0.98 0.98 0.98 0.98 41,666 208M
IndicLID-FTN-8-dim (20) 0.98 0.99 0.98 0.98 29,411 318M
NLLB (20) 0.99 0.99 0.99 0.98 4,970 1.1G

Roman LID Results

Following table presents the results of different model variants on the romanized testset. Throughput is number of sentence/second.

Model Precison Recall F1-score Accuracy Throughput Model Size
IndicLID-FTR (dim-8) 0.63 0.78 0.63 0.71 37,037 357 M
IndicLID-BERT (unfeeze-layer-1) 0.73 0.84 0.75 0.80 3 1.1 GB
IndicLID (threshold-0.6) 0.73 0.85 0.75 0.80 10 1.4 GB

Resources

Download Bhasha-Abhijnaanam Test Set

Huggingface

Github

Download Training Data

Native Script Training Data

Roman Script Training Data

Parallel native-roman train pairs

Download IndicLID model

IndicLID-FTN v1.0

IndicLID-FTR v1.0

IndicLID-BERT v1.0

## Running Inference ### Interface

Inference Notebook --> Open In Colab

Training model

Setting up your environment

Click to expand ```bash pip3 install fasttext pip3 install transformers ```

Training procedure and code

We train 3 models separately which are the components of IndicLID model. Please refer to the paper for more architectural detials.

We use fasttext models to train out IndicLID-FTR and IndicLID-FTN component. Following are the steps to train our fasttext models.

model = fasttext.train_supervised( input = '../corpus/train_combine.txt', loss = 'hs', verbose = 1, dim = 8, autotuneValidationFile = '../corpus/valid_combine.txt', autotuneDuration = 14400*3 ) model.save_model("../result/model_baseline_roman.bin")


For our IndicLID-BERT model, we finetune [IndicBERT](https://github.com/AI4Bharat/IndicBERT) model with our romaanized training data. Script for the training IndiLID-BERT model can be found [here](https://github.com/AI4Bharat/IndicLID/blob/master/final_runs_train/roman_model/finetuning/IndicBERT/unfreeze_layers/train.py).

### Evaluating trained model
Script to generate the output can be found [here](https://github.com/AI4Bharat/IndicLID/blob/master/Inference/ai4bharat/IndicLID.py).

## Directory structure

IndicLID/ ├── Benchmark │   ├── compile_final_pilot_1.py │   ├── create_benchmark_extra.py │   └── create_benchmark.py ├── deployement │   ├── test_script.py │   └── working │   └── IndicLID.py ├── filter_Dakshina │   ├── merge_final_native.py │   ├── merge_final.py │   └── separate_validation_set.py ├── final_runs_ACL_inference │   ├── analysis │   │   ├── language_wise │   │   │   ├── inference.py │   │   │   ├── prepare_corpus.py │   │   │   ├── train_fasttext.py │   │   │   ├── train_IndicBERT.py │   │   │   └── word_overlap_confustion_matrix.py │   │   ├── length_wise │   │   │   ├── acc_len_wise_analysis.py │   │   │   └── save_prediction_dict.py │   │   └── word_embeddings │   │   ├── PCA_cluster_embeddings.py │   │   ├── TSNE_cluster_embeddings.py │   │   └── word_neighbours.py │   ├── native_model │   │   ├── cld3_comparison │   │   │   ├── cld3 │   │   │   │   ├── inference.py │   │   │   │   ├── inference_time_1.py │   │   │   │   ├── inference_time.py │   │   │   │   └── prepare_corpus.py │   │   │   ├── fasttext_4 │   │   │   │   ├── inference.py │   │   │   │   ├── inference_time_1.py │   │   │   │   ├── inference_time.py │   │   │   │   └── prepare_corpus.py │   │   │   └── fasttext_8 │   │   │   ├── inference.py │   │   │   ├── inference_time_1.py │   │   │   ├── inference_time.py │   │   │   └── prepare_corpus.py │   │   ├── corpus_inf_native │   │   │   ├── lang_stat_test_native.csv │   │   │   └── prepare_corpus.py │   │   ├── fasttext │   │   │   └── tune_run │   │   │   ├── inference.py │   │   │   ├── inference_time_1.py │   │   │   ├── inference_time.py │   │   │   ├── post_error_analysis.py │   │   │   ├── prepare_corpus.py │   │   │   ├── temp.sh │   │   │   └── train.py │   │   ├── finetuning │   │   │   ├── IndicBERT │   │   │   │   ├── freezed_bert_all_layer │   │   │   │   │   ├── inference.py │   │   │   │   │   ├── len_wise_analysis.py │   │   │   │   │   └── train.py │   │   │   │   └── unfreeze_layers │   │   │   │   ├── inference.py │   │   │   │   ├── len_wise_analysis.py │   │   │   │   ├── temp.sh │   │   │   │   └── train.py │   │   │   ├── MuRIL │   │   │   │   ├── freezed_bert_all_layer │   │   │   │   │   ├── inference.py │   │   │   │   │   └── train.py │   │   │   │   └── unfreeze_layers │   │   │   │   ├── inference.py │   │   │   │   ├── len_wise_analysis.py │   │   │   │   ├── temp.sh │   │   │   │   └── train.py │   │   │   └── XMLR │   │   │   └── freezed_bert_all_layer │   │   │   ├── inference.py │   │   │   └── train.py │   │   └── nllb_comparison │   │   ├── fasttext_4 │   │   │   ├── inference.py │   │   │   ├── inference_time_1.py │   │   │   ├── inference_time.py │   │   │   └── prepare_corpus.py │   │   ├── fasttext_8 │   │   │   ├── inference.py │   │   │   ├── inference_time_1.py │   │   │   ├── inference_time.py │   │   │   └── prepare_corpus.py │   │   ├── indicbert │   │   │   ├── inference.py │   │   │   └── prepare_corpus.py │   │   ├── indiclid_fast_4 │   │   │   ├── 2_stage_inference.py │   │   │   └── prepare_corpus.py │   │   ├── indiclid_fast_8 │   │   │   ├── 2_stage_inference.py │   │   │   └── prepare_corpus.py │   │   └── nllb │   │   ├── inference.py │   │   ├── inference_time_1.py │   │   ├── inference_time.py │   │   └── prepare_corpus.py │   ├── roman_model │   │   ├── corpus_inf_roman │   │   │   ├── lang_stat_test_roman.csv │   │   │   ├── lang_stat_test_romanized_indicxlit.csv │   │   │   └── prepapre_test_set.py │   │   ├── fasttext │   │   │   └── tune_run │   │   │   ├── inference.py │   │   │   ├── inference_time_1.py │   │   │   └── inference_time.py │   │   └── finetuning │   │   ├── IndicBERT │   │   │   ├── freezed_bert_all_layer │   │   │   │   ├── inference.py │   │   │   │   ├── len_wise_analysis.py │   │   │   │   └── train.py │   │   │   └── unfreeze_layers │   │   │   ├── inference.py │   │   │   ├── inference_time_1.py │   │   │   ├── inference_time.py │   │   │   ├── len_wise_analysis.py │   │   │   ├── temp.sh │   │   │   └── train.py │   │   ├── MuRIL │   │   │   ├── freezed_bert_all_layer │   │   │   │   ├── inference.py │   │   │   │   ├── slurm-120303.out │   │   │   │   └── train.py │   │   │   └── unfreeze_layers │   │   │   ├── inference.py │   │   │   ├── len_wise_analysis.py │   │   │   ├── temp.sh │   │   │   └── train.py │   │   └── XMLR │   │   └── freezed_bert_all_layer │   │   ├── inference.py │   │   └── train.py │   ├── two_stage │   │   ├── IndicBERT │   │   │   ├── 2_stage_inference.py │   │   │   ├── display_confusion_matrix.py │   │   │   ├── inference_time_1.py │   │   │   ├── inference_time.py │   │   │   └── prepare_scored_dakshina_romanized.py │   │   └── MuRIL │   │   └── 2_stage_inference.py │   └── two_stage_native │   └── IndicBERT_fasttext_8 │   ├── 2_stage_inference.py │   ├── display_confusion_matrix.py │   └── prepare_scored_dakshina_romanized.py ├── nueral_net │   ├── experiments │   │   ├── skeleton │   │   │   ├── create_sen_embed.py │   │   │   ├── inference.py │   │   │   ├── prepare_corpus.py │   │   │   └── train.py │   │   └── skeleton_transform │   │   ├── inference.py │   │   ├── prepare_corpus.py │   │   └── train.py │   └── experiments_tune │   └── skeleton_tuning │   ├── inference.py │   ├── prepare_corpus.py │   └── train.py ├── preprocess_indiccorp │   ├── IndicCorp_data │   │   ├── extract_and_combine.log │   │   ├── extract_and_combine.py │   │   ├── indiccorp_combine_stats.csv │   │   └── readme.md │   ├── IndicCorp_data_subset │   │   ├── combine_sources.py │   │   ├── indiccorp_combine_stats.csv │   │   └── readme.md │   ├── IndicCorp_data_subset_tok_norm │   │   ├── readme.md │   │   └── tokenize_and_normalize.py │   ├── IndicCorp_data_subset_tok_norm_romanized │   │   ├── create_ip_word_list.py │   │   ├── fairseq_postprocess.py │   │   ├── indic_tok.py │   │   ├── interactive.sh │   │   ├── lang_list.txt │   │   ├── preprocess_en.py │   │   ├── readme.md │   │   └── romanize_corpus.py │   └── IndicCorp_data_subset_tok_norm_romanized_100k_sample_cleaned │   ├── create_ip_word_list.py │   ├── fairseq_postprocess.py │   ├── indic_tok.py │   ├── interactive.sh │   ├── lang_list.txt │   ├── preprocess_en.py │   └── romanize_corpus.py └── README.md



<!-- Citing information -->

We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.

<!-- License -->
### License
The IndicLID code (and models) are released under the MIT License.

<!-- Contributors -->
### Contributors
 - Yash Madhani <sub> ([AI4Bharat](https://ai4bharat.iitm.ac.in/), [IITM](https://www.iitm.ac.in)) </sub>
 - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.iitm.ac.in/), [IITM](https://www.iitm.ac.in)) </sub>
 - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.iitm.ac.in/), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>

<!-- Contact -->
### Contact
- Yash Madhani <sub> ([AI4Bharat](https://ai4bharat.iitm.ac.in/), [IITM](https://www.iitm.ac.in)) </sub>    
- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))

## Acknowledgements

We would like to thank the Ministry of Electronics and Information Technology of the Government of India for their generous grant through the Digital India Bhashini project. We also thank the Centre for Development of Advanced Computing for providing compute time on the Param Siddhi Supercomputer. We also thank Nilekani Philanthropies for their generous grant towards building datasets, models, tools and resources for Indic languages. We also thank Microsoft for their grant to support research on Indic languages. We would like to thank Jay Gala and Ishvinder Sethi for their help in coordinating the annotation work. Most importantly we would like to thank all the annotators who helped create the Bhasha-Abhijnaanam benchmark.