Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
43 stars 110 forks source link

Being able to recognize a subset of spoken Hindi words through offline models #285

Open GautamR-Samagra opened 8 months ago

GautamR-Samagra commented 8 months ago

Task :
Create an offline alternative to Google's read along app in Hindi. It should be able to show a set of words and be able to determine if you have spoken the word correctly or not.

There are 2 approaches we have taken for this :

Checking Offline transcription models - Vosk :

Figuring out vector representation of required words :

This is something we haven't tried yet. The idea is that the sheet shared earlier already gives us a list of Hindi words that we need to match the recordings with. So if we are to use some Acoustic word encodings to enocde them in vector form and then use directly for matching against encoded recordings, that should be good enough for our use case. You can contribute details on what would be the next steps to follow here.

xorsuyash commented 8 months ago

Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .

xorsuyash commented 8 months ago

@GautamR-Samagra please allow me the access of the above linked sheets .

GautamR-Samagra commented 8 months ago

Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .

Thanks for trying to contribute :) I don't want to assign it yet,do raise a PR once you are able to contribute and I'll assign it to you.

xorsuyash commented 8 months ago

@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .

GautamR-Samagra commented 8 months ago

@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .

I have given you access to the sheet. Have also collated the audios separately in a folder here

GautamR-Samagra commented 8 months ago

@xorsuyash thanks for pointing out that Speechtovec as an embeddings approach for this doesn't make sense as its finally trained on semantics. Will an 'Acoustic word embedding' model make more sense here (like this ) ?

xorsuyash commented 8 months ago

@GautamR-Samagra Acoustic word embedding will help us to cluster same words spoken by different speakers, still trying figure out way to fine tune vosk . one way acoustic word embedding can help us is by clustering the words spoken by different speaker to get an estimate of spoken word by similarity measure and then will be help full to predict that word . https://colab.research.google.com/drive/1sWgS9JBsaqf7q_936PkTKSrnHLZfWNiS?usp=sharing