Research on existing datasets and Techniques - Githubissues

ChakshuGautam / whisper-hinglish

1 stars 0 forks source link

Research on existing datasets and Techniques #2

Open rayaanoidPrime opened 6 months ago

rayaanoidPrime commented 6 months ago

There are some exsting datasets that we can leverage directly such as -

https://www.openslr.org/104/ contains aligned Hindi-English extracted from spoken tutorials of technical topics and lectures Hindi-English train and test datasets contain 89.86 hours and 5.18 hours. Huggingface link - https://huggingface.co/datasets/ujs/hinglish-compressed

Synthetic generation of code switching dataset generation from monolingual sources

https://arxiv.org/abs/2306.08753v3 : This paper outlines the methodology to generate code switching dataset from monolingual sources.
- Code - https://github.com/NVIDIA/NeMo/tree/main/scripts/speech_recognition/code_switching

Tasks :

[] Research on more datasets already existing that we can use directly
[] Write scripts to collate all the dataset sources into a single dataset.

harshaharod21 commented 6 months ago

Already existing datasets : 1) https://github.com/google-research-datasets/hinglish-top-dataset Sourced from : Ai4bharathttps://github.com/AI4Bharat/indicnlp_catalog 2) https://github.com/goru001/nlp-for-hinglish dataset link: datasethttps://www.dropbox.com/sh/as5fg8jsrljt6k7/AADnSLlSNJPeAndFycJGurOUa?e=1&dl=0

As this project is for public usecases , can we not request the free access of the below dataset from IITG https://www.iitg.ac.in/eee/emstlab/HingCoS_Database/HingCoS.html

Existing ASR model for Hinglish: https://github.com/Open-Speech-EkStep/vakyansh-models?tab=readme-ov-file#interspeech-2021-asr-models