Being able to recognize a subset of spoken Hindi words through offline models

Gautam-Rajeev commented 10 months ago

Task :
Create an offline alternative to Google's read along app in Hindi. It should be able to show a set of words and be able to determine if you have spoken the word correctly or not.

There are 2 approaches we have taken for this :

Determine the words spoken by an Offline transcription model and check if the 'actual word' (i.e. word supposed to be read) matches with it or not.
Maintain vector representation the required set of words and match the recorded words against them { no work on this yet- just spitballing}

Checking Offline transcription models - Vosk :

Data : We went on ground and collected a bit of data (20 mins of children speaking the required words) and have collected them here - S2T PoC Content.xlsx
Actual transcription of the audio data: However the transcripts of the audio will not match the 'paragraph' that is read as the students often repeat the words multiple times to get it right. Hence , I have transcribed all the audios using Conformer (Bhashini) to get better quality transcribed output. Have collated the transcripts of the audio here - base64_and_transcripts.xlsx
Looking at small transcription models: We tested out Vosk - the smallest accurate transcription model we could find and have collated their accuracy here Collab to use vosk on wav files and get accuracy is here
Vosk accuracy and next steps : We found that Vosk struggles to recognize the word correctly when the recorded audio has too much noise (which is often in our case). Hence a track on trying to fine-tune Vosk with our data to see if that improves the transcription. Fine tuning of Vosk models is covered by them in this ticket
Other approaches : We also took a crack at trying out whisper tiny for the same with the goal of quantizing it later for mobile use. Whisper tiny is 150 MB and we would ideally like our model to be around ~50 MB. However, whisper tiny didn't recognize Hindi and other Hindi whisper tiny models gave very poor transcriptions. This is done here

Figuring out vector representation of required words :

This is something we haven't tried yet. The idea is that the sheet shared earlier already gives us a list of Hindi words that we need to match the recordings with. So if we are to use some Acoustic word encodings to enocde them in vector form and then use directly for matching against encoded recordings, that should be good enough for our use case. You can contribute details on what would be the next steps to follow here.

xorsuyash commented 10 months ago

Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .

xorsuyash commented 10 months ago

@GautamR-Samagra please allow me the access of the above linked sheets .

Gautam-Rajeev commented 10 months ago

Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .

Thanks for trying to contribute :) I don't want to assign it yet,do raise a PR once you are able to contribute and I'll assign it to you.

xorsuyash commented 10 months ago

@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .

Gautam-Rajeev commented 10 months ago

@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .

I have given you access to the sheet. Have also collated the audios separately in a folder here

Gautam-Rajeev commented 10 months ago

@xorsuyash thanks for pointing out that Speechtovec as an embeddings approach for this doesn't make sense as its finally trained on semantics. Will an 'Acoustic word embedding' model make more sense here (like this ) ?

xorsuyash commented 10 months ago

@GautamR-Samagra Acoustic word embedding will help us to cluster same words spoken by different speakers, still trying figure out way to fine tune vosk . one way acoustic word embedding can help us is by clustering the words spoken by different speaker to get an estimate of spoken word by similarity measure and then will be help full to predict that word . https://colab.research.google.com/drive/1sWgS9JBsaqf7q_936PkTKSrnHLZfWNiS?usp=sharing

Samagra-Development / ai-tools

Being able to recognize a subset of spoken Hindi words through offline models #285

Checking Offline transcription models - Vosk :

Figuring out vector representation of required words :