Open Gautam-Rajeev opened 10 months ago
Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .
@GautamR-Samagra please allow me the access of the above linked sheets .
Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .
Thanks for trying to contribute :) I don't want to assign it yet,do raise a PR once you are able to contribute and I'll assign it to you.
@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .
@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .
I have given you access to the sheet. Have also collated the audios separately in a folder here
@xorsuyash thanks for pointing out that Speechtovec as an embeddings approach for this doesn't make sense as its finally trained on semantics. Will an 'Acoustic word embedding' model make more sense here (like this ) ?
@GautamR-Samagra Acoustic word embedding will help us to cluster same words spoken by different speakers, still trying figure out way to fine tune vosk . one way acoustic word embedding can help us is by clustering the words spoken by different speaker to get an estimate of spoken word by similarity measure and then will be help full to predict that word . https://colab.research.google.com/drive/1sWgS9JBsaqf7q_936PkTKSrnHLZfWNiS?usp=sharing
Task :
Create an offline alternative to Google's read along app in Hindi. It should be able to show a set of words and be able to determine if you have spoken the word correctly or not.
There are 2 approaches we have taken for this :
Checking Offline transcription models - Vosk :
Data : We went on ground and collected a bit of data (20 mins of children speaking the required words) and have collected them here - S2T PoC Content.xlsx
Actual transcription of the audio data: However the transcripts of the audio will not match the 'paragraph' that is read as the students often repeat the words multiple times to get it right. Hence , I have transcribed all the audios using Conformer (Bhashini) to get better quality transcribed output. Have collated the transcripts of the audio here - base64_and_transcripts.xlsx
Looking at small transcription models: We tested out Vosk - the smallest accurate transcription model we could find and have collated their accuracy here Collab to use vosk on wav files and get accuracy is here
Vosk accuracy and next steps : We found that Vosk struggles to recognize the word correctly when the recorded audio has too much noise (which is often in our case). Hence a track on trying to fine-tune Vosk with our data to see if that improves the transcription. Fine tuning of Vosk models is covered by them in this ticket
Other approaches : We also took a crack at trying out whisper tiny for the same with the goal of quantizing it later for mobile use. Whisper tiny is 150 MB and we would ideally like our model to be around ~50 MB. However, whisper tiny didn't recognize Hindi and other Hindi whisper tiny models gave very poor transcriptions. This is done here
Figuring out vector representation of required words :
This is something we haven't tried yet. The idea is that the sheet shared earlier already gives us a list of Hindi words that we need to match the recordings with. So if we are to use some Acoustic word encodings to enocde them in vector form and then use directly for matching against encoded recordings, that should be good enough for our use case. You can contribute details on what would be the next steps to follow here.