:bookmark: The Indic NLP Catalog
A Collaborative Catalog of Resources for Indic Language NLP
The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.
Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:
[Wikipedia Dumps](https://dumps.wikimedia.org/)
Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.
:+1: Featured Resources
Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.
- Universal Language Contribution API (ULCA): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the Bhasini mission. You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination.
- We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , etc. These are being built using either large-scale mining of web-resource or large human annotation efforts or both.
- As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordNet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages.
- Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, etc.
- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, BUET CSE NLP, KMI, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages.
Browse the entire catalog...
:raising_hand:Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.
Major Indic Language NLP Repositories
Libraries and Tools
- Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
- pyiwn: Python Interface to IndoWordNet
- Indic-OCR : OCR for Indic Scripts
- CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
- iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
- Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
- Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
- BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
- CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language
- IndIE: An Open Information Extraction tool (triple extractor) in Hindi. It is conjectured to work for Tamil, Telugu, and Urdu as well.
- Hindi-BenchIE: A triple evaluation tool for 112 Hindi sentences.
Evaluation Benchmarks
Benchmarks spanning multiple tasks.
- AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
- AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks: biography generation, sentence summarization, headline generation, paraphrase generation and question generation.
- GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
- AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
- WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.
Standards
- Unicode Standard for Indic Scripts
Text Corpora
Monolingual Corpus
- [AIBharat IndicCorp]: Text corpora for Indian languages
- v1: contains 8.9 billion tokens from 12 Indian languages (including Indian English). [paper]
- v2: contains 20 billion tokens from 22 Indian languages (including Indian English). [paper]
- Wikipedia Dumps
- Common Crawl
- OSCAR Corpus: Released in 2019, large-scaled processed CommonCrawl.
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- [CC-100 Corpus](): Facebook CommonCrawl extracted data. They provide scripts for processing CommonCrawl. StatMT has built a replica of the CC-100 corpus using these scripts. You can find it HERE. This corpus also has romanized corpora for some Indian languages.
- WMT NEWS Crawl
- LDCIL Monolingual Corpus
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- Janmabhumi Malayalam Corpus
- Leipzig Corpus
- Sanskrit Monolingual and Sandhi-split Corpus
- Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
- CMU Romanized Hinglish Corpus: See THIS PAPER for details.
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- KMI Linguistics Bodo: Contains the Bodo corpus and the frequency-ordered word and punctuation list.
- SMC Malayalam text corpus
- DNLP-Tel Telugu Corpus: Telugu corpus of 280M tokens and 23M sentences along with skip-gram model trained with word2vec.
- Ema-lon Manipuri Corpus: The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2.
- SinMin Corpus: Contains texts of different genres and styles of the modern and old Sinhala language.
- Kangri_corpus: Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in this paper.
- Sanskrit-Hindi-MT: The Sanskrit Monolingual Data is available here.
- FacebookDecadeCorpora: Contains two language corpora of colloquial Sinhala content extracted from Facebook using the Crowdtangle platform. The larger corpus contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil and the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. Described in this paper.
- Nepali National corpus: The Nepali Monolingual written corpus comprises the core corpus containing 802,000 words and the general corpus containing 1,400,000 words. Described here.
Language Identification
Lexical Resources and Semantic Similarity
NER Corpora
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- WikiAnn NER Corpus (Noisy) DOWNLOAD (Old broken LINK)
- IJCNLP 200 NER Corpus: NER corpora for hi, bn, or, te, ur.
- a-mma NER data
- AI4Bharat Naamapadam: NER dataset for 11 Indic languages.
- AsNER: A named entity annotation dataset for low resource Assamese language containing 99k tokens.
- L3Cube-MahaNER: The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in this paper.
- CFILT HiNER: A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in this paper.
- MultiCoNER: A multilingual complex Named Entity Recognition dataset composed of 2.3 million instances for 11 languages(including dataset for Indic languages Hindi and Bangla) representing three domains(wiki sentences, questions, and search queries) plus multilingual and code-mixed subsets.The NER tag-set consists of six classes viz.: PER,LOC,CORP,GRP,PROD and CW. Described in this paper.
Parallel Translation Corpus
- BPCC Parallel Corpus: Largest parallel corpus for English and 22 Indian languages (as of Jan 2024). It comprises 230 million sentence pairs between English-Indian languages. A subset of this corpus is the BPCC-Human Corpus containing 2.2 English-Indic pairs for 22 Indic languages.
- Samanantar Parallel Corpus: Largest parallel corpus for English and 11 Indian languages (as of 2021). It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages.
- FLORES-101: Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel.
- FLORES-200: Human translated evaluation sets for 200 languages released by Facebook. It includes 24 Indic languages. The testsets are n-way parallel.
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million segments)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
- PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
- OPUS corpus
- WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
- Charles University Parallel Corpora Collection
- Indian Language Corpora Initiative: Available on TDIL portal on request
- IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
- MTurk Indian Parallel Corpus
- TED Parallel Corpus
- JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah's Witness.
- ALT Parallel Corpus: 10k sentences for Bengali, Hindi in parallel with English and many East Asian languages.
- FLORES dataset: English-Sinhala and English-Nepali corpora
- Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
- NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
- Wikititles: from statmt
- JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
- EILMT Corpus
- QED Corpus: English-Hindi corpus of 43k sentences from the educational domain.
- WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
- CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy (statmt repo).
- CGNetSwara: Hindi-Gondi parallel corpus (19k sentence pairs)
- MTEnglish2Odia: English-Odia (42k pairs)
- SAP Software Documentation: test and evaluation set for English-Hindi in the software documentation domain [paper]
- BUET English-Bangla Corpus, EMNLP-2020: 2.7M sentences (has overlaps with OPUS)
- CLE Parallel Corpus: Parallel corpus for English, Urdu and Nepali.
- Itihasa Parallel Corpus: 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata.
- Ema-lon Manipuri Corpus: The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences.
- PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in this paper.
- IIIT-H en-hi-codemixed-corpus: A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in this paper.
- CALCS 2021 Eng-Hinglish dataset: Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in this paper.
- Kangri_corpus: The corpus contains 27,362 Hindi-Kangri Parallel corpora. Described in [this paper] (https://arxiv.org/abs/2103.11596).
- NLLB-Seed: Small human-translated parallel corpora from Wikipedia articles for very low resource languages. Includes 5 Indian languages: Kashmiri, Manipuri, Maithili, Bhojpuri, Chattisgarhi.
- NLLB-MD: NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages.
- NLLB-Mined: All the parallel corpora mined by the NLLB project. This repository was reconstructed by AllenAI based on metadata released by the NLLB Project.
- PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in this paper.
- Sanskrit-Hindi-MT: Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data.
- Nepali National corpus: The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described here.
- Kathmandu University-English–Nepali Parallel Corpus: A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in this paper.
- CCAligned: A Massive Collection of more than 100 million cross-lingual web-document pairs in 137 languages aligned with English.
- CoPara: Long-context parallel corpora for 4 Dravidian languages. Contains 2586 passage pairs mined from New India Samachar [paper]
MT Evaluation
- WMT23 QE task: QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. report
- AI4Bharat IndicMT-Eval: MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems).
Parallel Transliteration Corpus
Text Classification
Textual Entailment/Natural Language Inference
Paraphrase
Sentiment, Sarcasm, Emotion Analysis
Hate Speech and Offensive Comments
- Hate Speech and Offensive Content Identification in Indo-European Languages: (HASOC FIRE-2020)
- An Indian Language Social Media Collection for Hate and Offensive Speech, 2020: Hinglish Tweets and FB Comments collected during Parliamentary Election 2019 of India (Dataset available on request)
- Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018: Scraped from Facebook (21k) & Twitter (18k) (Paper)
- Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018: 3k tweets (Paper)
- A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018: 4.5k Tweets (Paper)
- Roman Urdu Offensive Language Detection, 2020: 10k tweets, can also used for Hindi, (Paper)
- Bengali Hate Speech - Classification Benchmark, 2020: 1.5k sentences
- Offensive Language Identification in Dravidian Languages, EACL 2021: Tamil, Malayalam, Kannada
- Fear Speech in Indian WhatsApp Groups, 2021
- HateCheckHIn: An evaluation dataset for Hindi Hate Speech Detection Models having a total of 34 functionalities out of which 28 functionalities are monolingual and the remaining 6 are multilingual. Hindi is used as the base language. Described in this paper.
Question Answering
- Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
- TyDi QA datasets: QA dataset for Bengali and Telugu.
- bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
- MMQA dataset: Hindi QA dataset described in this paper
- XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
- XQA: testset for Tamil QA. Described in this paper
- HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
- IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
- Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.
- csebuetnlp Bangla QA: A Question Answering (QA) dataset for Bengali. Described in this paper.
- XOR QA: A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in this paper. More information is available here.
- IITB HiQuAD: A question answering dataset in Hindi consisting of 6555 question-answer pairs. Described in this paper.
Dialog
Discourse
Information Extraction
- EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
- [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.
- Amazon MASSIVE: A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in this paper.
- Facebook - MTOP Benchmark: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in this paper.
POS Tagged corpus
- Indian Language Corpora Initiative
- Universal Dependencies
- IIITH Paninian Treebank: POS annotations for hi, bn, kn, ml and mr.
- Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 5000 sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- Tham Khasi Corpus: An annotated Khasi POS tagged corpus containing 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations).
Chunk Corpus
Dependency Parse Corpus
Coreference Corpus
Summarization
- XL-Sum: A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Span 150k examples across 10 Indic languages. Described in this paper.
- TeSum: Telugu Abstractive Summarization dataset containing 20k+ article-summary pairs, with the summaries being manually created. [paper]
- WikiLingua: Cross-lingual summarization dataset created from WikiHow. Contains 9k English-Hindi article-summary pairs. [paper]
- MassiveSum: A large summarization dataset for containing 13 Indian languages with ~1.9million article-summary pairs. The summaries are mined from article metadata. [paper]
Data to Text
- XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in this paper.
Models
Language Identification
- NLLB-200: LID for 200 languages including 27 Indic languages.
Word Embeddings
Pre-trained Language Models
- AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
- AI4Bharat IndicBART: A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in this paper.
- MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
- BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
- mBART50: seq2seq pre-trained model trained on CommonCrawl of many languages (including major Indic languages).
- BLOOM: GPT3 like multilingual transformer-decoder language model (includes major Indic languages.
- iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
- albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
- RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
- Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets.
- BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in this paper.
- EM-ALBERT: The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences.
- LaBSE: Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [paper].
- LASER3: Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges).
Multilingual Word Embeddings
Morphanalyzers
Translation Models
- IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
- Shata-Anuvaadak: SMT for 110 language pairs (all pairs between English and 10 Indian languages.
- LTRC Vanee: Dependency based Statistical MT system from English to Hindi.
- NLLB-200: Models for 200 languages including 27 Indic languages.
Transliteration Models
- AI4Bharat IndicXlit: A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in this paper.
Speech Models
NER
Speech Corpora
- Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
- Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
- AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
- IIT Madras TTS database
- BABEL Speech Corpus: includes some Indian languages
- WikiPron: Words and their pronunciations in IPA mined from Wiktionary. Includes Indian languages. paper
- CVIT IndicSpeech: TTS data for 3 Indian languages: Malayalam, Bengali and Hindi (24 hours each).
- Google Speech Corpus: TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. (paper)
- CoVoST 2: Tamil 2 hrs data
- SMC Malayalam Speech Corpus - Download link
- Vāksañcayaḥ Sanskrit Speech Corpus : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 (paper).
- IISc-MILE Kannada ASR Corpus: Transcribed speech corpus containing ~350 hours of read speech data for training ASR systems for Kannada language. Described in this paper.
- IISc-MILE Tamil ASR Corpus: Transcribed speech corpus containing ~150 hours of read speech data for training ASR systems for Tamil language. Described in this paper.
- MUCS 2021 Dataset: (Gujarati, Hindi, Marathi, Odia, Tamil, Telugu) Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
- Gramvaani: 100 hours of labelled data and 1000 hours of pretraining data for Hindi
- Kashmiri Data Corpus: Collection of transcribed Kashmiri recordings taken from native speakers
- Hindi-Tamil-English ASR Challenge: 490 hours of transcribed speeech data in three Indian Languages
- Large Sinhala ASR training data set: Sinhala ASR training data set containing ~185K utterances
- Large Bengali ASR training data set: Bengali ASR training data set containing ~196K utterances
- Large Nepali ASR training data set: Nepali ASR training data set containing ~157K utterances
- Crowdsourced high-quality Gujarati multi-speaker speech data set: Contains recordings of native speakers of Gujarati
- Crowdsourced high-quality Kannada multi-speaker speech data set: Contains recordings of native speakers of Kannada
- Crowdsourced high-quality Malayalam multi-speaker speech data set: Contains recordings of native speakers of Malayalam
- Crowdsourced high-quality Marathi multi-speaker speech data set: Contains recordings of native speakers of Marathi
- Crowdsourced high-quality Tamil multi-speaker speech data set: Contains recordings of native speakers of Tamil
- Crowdsourced high-quality Telugu multi-speaker speech data set: Contains recordings of native speakers of Telugu
- Nepali National corpus: The Nepali Spoken Corpus contains audio recordings from different 17 types of social activities with a total temporal recording duration of 31 hours and 26 minutes. Described here.
- Shrutilipi: Over 6400 hours of transcribed speech corpus across 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu
OCR Corpora
Multimodal Corpora
Language Specific Catalogs
Pointers to language-specific NLP resource catalogs