This repository contains the code and other resources for the paper published at ACL 2023.
Benchmark | Corpus | Models | Pretraining | Fine-tuning | Paper
IndicXTREME benchmark includes 9 tasks that can be broadly grouped into sentence classification (5), structure prediction (2), question answering (1), and sentence retrieval (1).
The list of tasks are as follows:
Language | Download Link | Language | Download Link |
---|---|---|---|
Assamese | Download | Malayalam | Download |
Bodo | Download | Manipuri | Download |
Bengali | Download | Marathi | Download |
Dogri | Download | Nepali | Download |
English | Download | Odia | Download |
Konkani | Download | Punjabi | Download |
Gujarati | Download | Sanskrit | Download |
Hindi | Download | Santali | Download |
Khasi | Download | Sindhi | Download |
Kannada | Download | Tamil | Download |
Kashmiri | Download | Telugu | Download |
Maithili | Download | Urdu | Download |
A multilingual language model trained on IndicCorp v2 and evaluated on IndicXTREME benchmark. The model has 278M parameters and is available in 23 Indic languages and English. The models are trained with various objectives and datasets. The list of models are as follows:
The current BERT Preprocessig code needs to run in Tensorflow v2. Create a new conda environment and set it up as follows:
conda create -n tpu_data_prep python=3.7
pip install tokenizers transformers tqdm joblib indic-nlp-library
conda install tensorflow==2.3.0
Train a WordPiece Tokenizer to preprocess the data. The following command trains a tokenizer and saves it in the specified path.
Arguments:
python IndicBERT/tokenization/build_tokenizer.py \
--input_file=$INPUT \
--output_dir=$OUTPUT \
--vocab_size=$VOCAB_SIZE
Run the following command after update the required paths in the script:
python IndicBERT/process_data/create_mlm_data.py \
--input_file=$INPUT \
--output_file=$OUTPUT \
--input_file_type=$DATA_TYPE \
--tokenizer=$TOKENIZER_PATH \
--max_seq_length=$MAX_SEQ_LEN \
--max_predictions_per_seq=$MAX_PRED \
--do_whole_word_mask=$WHOLE_WORD_MASK \
--masked_lm_prob=$MASK_PROB \
--random_seed=$SEED \
--dupe_factor=$DUPE_FACTOR \
monolingual
or parallel
monolingual
: if the input file is a monolingual corpus
parallel
: if the input file is a parallel corpus
input.en
and input.lang
The BERT Pretraining code is a modified version of Google BERT Repo, without NSP and customisation to support parallel data. The training code need to run on Tensorflow v1. Create a new conda environment and set it up as follows:
conda env create --name bert_pretraining
conda activate bert_pretraining
conda install -c conda-forge tensorflow==1.14
Run the following command for pretraining:
python IndicBERT/train/run_pretraining.py \
--input_file=$INPUTS \
--output_dir=$OUTPUTS \
--do_train=True \
--bert_config_file=$BERT_CONFIG \
--train_batch_size=$BS \
--max_seq_length=$MAX_SEQ_LEN \
--max_predictions_per_seq=$MAX_PRED \
--num_train_steps=$TRAIN_STEPS \
--num_warmup_steps=$WARMUP \
--learning_rate=$LR \
--save_checkpoints_steps=$SAVE_EVERY \
--use_tpu=True \
--tpu_name=$TPU_NAME \
--tpu_zone=$TPU_ZONE \
--num_tpu_cores=$TPU_CORES
Note that to run the pretraining on TPUs, the input data and output directory should be on Google Cloud Storage
Arguments:
4096
should be same as the preprocessing step
should be same as the preprocessing step
1000000
10000
5e-4
n
stepsFine-tuning scripts are based on transformers library. Create a new conda environment and set it up as follows:
conda create -n finetuning python=3.9
pip install -r requirements.txt
All the tasks follow the same structure, please check individual files for detailed hyper-parameter choices. The following command runs the fine-tuning for a task:
python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
--model_name_or_path=$MODEL_NAME \
--do_train
Arguments:
ner, paraphrase, qa, sentiment, xcopa, xnli, flores
]For MASSIVE task, please use the instrction provided in the official repository
All the datasets created as part of this work will be released under a CC-0 license and all models \& code will be release under an MIT license
@inproceedings{doddapaneni-etal-2023-towards,
title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
author = "Doddapaneni, Sumanth and
Aralikatte, Rahul and
Ramesh, Gowtham and
Goyal, Shreya and
Khapra, Mitesh M. and
Kunchukuttan, Anoop and
Kumar, Pratyush",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.693",
doi = "10.18653/v1/2023.acl-long.693",
pages = "12402--12426",
abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
}