AI4Bharat / IndicWav2Vec

Pretraining, fine-tuning and evaluation scripts for Indic-Wav2Vec2
https://indicnlp.ai4bharat.org/indicwav2vec
MIT License
82 stars 28 forks source link

IndicWav2Vec

IndicWav2Vec is a multilingual speech model pretrained on 40 Indian langauges. This model represents the largest diversity of Indian languages in the pool of multilingual speech models. We fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public benchmarks, namely MUCS, MSR and OpenSLR.

As part of IndicWav2Vec we create largest publicly available corpora for 40 languages from 4 different language families. We also trained state-of-the-art ASR models for 9 Indian languages.

IndicW2V

Benchmarks

We evaluate our models on 3 publicly available benchmarks MUCS, MSR and OpenSLR and below mentioned are our results

Model gu ta te gu hi mr or ta te bn ne si
IndicW2V 20.5 22.1 22.9 26.2 16.0 19.3 25.6 27.3 29.3 16.6 11.9 24.8
IndicW2V + LM 11.7 13.6 11.0 17.2 14.7 13.8 17.2 25.0 20.5 13.6 13.6 -

Updates

21 June 2022

Added more documentation

Table of contents

Download Models

Language Acoustic Model Dictionary Language Model Lexicon Wandb
Bengali fairseq | [[hf]]() link KenLM link [link]()
Gujarati fairseq / [hf]() link KenLM link [link]()
Hindi fairseq / [hf]() link KenLM link [link]()
Marathi fairseq / [hf]() link KenLM link [link]()
Nepali fairseq / [hf]() link KenLM link [link]()
Odia fairseq / [hf]() link KenLM link [link]()
Tamil fairseq / [hf]() link KenLM link [link]()
Telugu fairseq / [hf]() link KenLM link [link]()
Sinhala fairseq / [hf]() link [KenLM]() [link]() [link]()
Kannada (KB) fairseq / [hf]() link KenLM link [link]()
Malayalam (KB) fairseq / [hf]() link KenLM link [link]()
Pretrained Model(*) Name Model Checkpoint
IndicWav2Vec Large fairseq
IndicWav2Vec Base fairseq

(* Trained on 40 Indian Languages, more details can be found here)

Hosted API Usage

Our models are hosted at the following API end points. Langugage Language Code API End point
Bengali bn [coming soon - will be back shortly]()
Gujarati gu [coming soon - will be back shortly]()
Hindi hi https://ai4b-dev-asr.ulcacontrib.org/asr/v1/recognize/hi
Marathi mr https://ai4b-dev-asr.ulcacontrib.org/asr/v1/recognize/mr
Nepali ne [coming soon - will be back shortly]()
Odia or [coming soon - will be back shortly]()
Tamil ta https://ai4b-dev-asr.ulcacontrib.org/asr/v1/recognize/ta
Telugu te https://ai4b-dev-asr.ulcacontrib.org/asr/v1/recognize/te
Sinhala si [coming soon - will be back shortly]()

Input API data format

{
    "config": {
        "language":{
          "sourceLanguage": "#Language Code"
        },
        "transcriptionFormat": {"value":"transcript"},
        "audioFormat": "wav"
    },
    "audio": [{
        "audioContent": "#BASE64 Encoded String"
    }]
}

OR

{
    "config": {
        "language":{
          "sourceLanguage": "#Language Code"
        },
        "transcriptionFormat": {"value":"transcript"},
        "audioFormat": "wav"
    },
    "audio": [{
        "audioUri": "#HTTP/GS path to file"
    }]
}

Output

{
    "output": [
        {
            "source": "सेकेंड स्टेप इस देसी है स्पेसिफाइड फॉरेस्ट राइट"
        }
    ],
    "status": "SUCCESS"
}

Accessing on ULCA

Our models can be directly accessed on ULCA by going into ASR section and filtering models by IndicWav2Vec.

App Screenshot

Quick start

Python Inference

Huggingface Inference

Tutorials

Setting up your environment

Pretraining

Data preparation

Or alternatively users can use the one single script process_data.sh to run the entire pipeline

Manifest Creation

For creating language-wise pretraining manifest

python path/to/lang_wise_manifest_creation.py /path/to/wave/files --dest /manifest/path --ext $ext --valid-percent $valid

For /path/to/wav/files/ we expect the directory to have one folder per language under the parent directory

In our pretraing, we use a --valid-percent as 0.03

For creating a combined validation file for all languages, we concatenate all individual *_valid.tsv files to create a valid.tsv file.

import pandas as pd
import glob

filenames = glob.glob("*_valid.tsv")

combined = []
for f in filename:
    df = pd.read_csv(f, skiprows=1, names=['f', 'd'], sep='\t')
    combined.append(df)

df_combined = pd.concat(combined, axis=0, ignore_index=True)
df_combined.to_csv('valid.tsv', index=True, header=False, sep='\t')

We then add the /path/to/wav/files/ to the first line of the valid.tsv file

Training procedure and code

For pretraining the model we do multi-node training and schedule the runs with slurm.

Following is the invocation script for training IndicWav2Vec base starting from Wav2Vec2.0 English base ckeckpoint

fairseq-hydra-train \
  task.data=/path/to/manifest/directory \
  common.wandb_project=<wandb project name> \
  task._name=temp_sampled_audio_pretraining \
  +task.sampling_alpha=0.7 \
  common.log_interval=200 \
  common.log_format=tqdm \
  dataset.max_tokens=3000000 \
  common.user_dir=/path/to/custom_task/directory \
  checkpoint.save_dir=/path/to/save/model/checkpoints \
  checkpoint.restore_file=/path/to wav2vec2-english-base/checkpoint.pt \
  +optimization.update_freq='[2]' \
  optimization.clip_norm=0.5 \
  checkpoint.reset_optimizer=true \
  distributed_training.distributed_world_size=<total GPUs> \
  distributed_training.distributed_port=$PORT \
  --config-dir /path/to/configs/directory \
  --config-name wav2vec2_base_librispeech"

For Large model we override the above configuration with

  checkpoint.restore_file=/path/to wav2vec2-english-large/checkpoint.pt \
  +optimization.update_freq='[6]' \
  lr_scheduler.warmup_updates=0 \
  --config-name wav2vec2_large_librivox"

Configs for both the models are provided in the configs directory

Finetuning

Data preparation

Finetuning procedure and code

Following is the invocation script for finetuning IndicWav2Vec large on a particular language

fairseq-hydra-train \
  task.data=/path/to/finetune/manifest/directory/for/a/particular/language \
  common.wandb_project=<wandb project name> \
  model.w2v_path=/path/to/pretrained/model_large.pt \
  common.log_interval=50 \
  common.log_format=tqdm \
  dataset.max_tokens=1000000 \
  checkpoint.save_dir=/path/to/save/model/fine_tune_checkpoints \
  +optimization.update_freq='[1]' \
  distributed_training.distributed_world_size=<total GPUs> \
  --config-dir /path/to/configs/directory \
  --config-name ai4b_xlsr"

For IndicWav2Vec Base model we override the above configuration with

  model.w2v_path=/path/to/pretrained/model_base.pt \
  --config-name ai4b_base"

Configs for both the models are provided in the [finetune_configs]() directory

Finetuning procedure and code

Language Modelling (LM)

We train 6-grams Statistical LM using KenLM library.

Data preparation

Ouput will be generate at: "<lm directory path>/<lang>".

Evaluating ASR models

Model exporting

Deployment

Cite

Please cite out work as:

@inproceedings{javed2021building,
    title = {Towards Building ASR Systems for the Next Billion Users},
    author = {Tahir Javed and Sumanth Doddapaneni and Abhigyan Raman and Kaushal Santosh Bhogale and Gowtham Ramesh and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
    booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
    year = "2022 (to appear)",
}

License

IndicWav2Vec is MIT-licensed. The license applies to all the pretrained, fine-tuned and language models

Contributors

Contact

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.