SOTA ASR for Hinglish - Githubissues

ChakshuGautam commented 6 months ago

Description

The transcript will be en for English words and hi for Hindi words. Using Whisper/Fairseq. Also, an alternate model that gives transcript in 100% hi or 100% en.

Why

Custom Tokenizer for Whisper/Fairseq in different ways for the same audio.
Public Datasets for all three variants.

Building datasets

[ ] Collection (1M Words - 100 hours) - 2 weeks
- [ ] @rayaanoidPrime to figure out mixed language podcasts. Ones with transcriptions to be prioritized
  - [ ] English heavy (70%) - Podcast | Conversational
  - [ ] Hindi heavy (30%) - Youtube | Monologue | Product review
[ ] Alignment and Cleaning datasets
- [ ] Transcriptions (existing ASR models for Hinglish) => Fixed by GPT through a prompt
- [ ] Manually fix transcripts
- [ ] Alignment issues to be fixed here itself - Common Alignment Model @xorsuyash
  - [ ] Fixed length chunks with en transcript.

Training

[ ] Whisper - 2 weeks * 3 iterations
- [ ] Tokenisation - 1 Week (figuring out)
- [ ] Training (smaller audios)- 1 week | How - 2 days and Training 3 days
- [ ] Training (30s audios) - 1 week | How - 2 days and Training 3 days
- [ ] Training (numbers) - 1 week | How - 2 days and Training 3 days
- [ ] Evaluation and Publishing
[ ] Fairseq - 3 weeks * 3 iterations
- [ ] Tokenisation - 2 Weeks (figuring out)
- [ ] Training (smaller audios)- 1 week | How - 2 days and Training 3 days
- [ ] Training (30s audios) - 1 week | How - 2 days and Training 3 days
- [ ] Training (numbers) - 1 week | How - 2 days and Training 3 days
- [ ] Evaluation and Publishing

xorsuyash commented 6 months ago

@ChakshuGautam we will have English audio-transcript pair as well as hindi audio-transcript pair for force alignment ?

rayaanoidPrime commented 6 months ago

@ChakshuGautam we will have English audio-transcript pair as well as hindi audio-transcript pair for force alignment ?

Hey man @xorsuyash , can you link your discord? Wanted to collaborate with you on some topics : )

harshaharod21 commented 6 months ago

Hi I'm Harsha. I'll be happy to contribute to this project. I went through the complete description given above and have understood the tasks and task flow. As I can see few of the tasks here are assigned to other contributors, on which task can I start working on? Should I start working on tasks in the issue 2 ? Also we should we working to find datasets of both types right, that is, Mixed language Transcription and Monolingual transcription?

ChakshuGautam / whisper-hinglish

SOTA ASR for Hinglish #1

Description

Why

Building datasets

Training