Open Gautam-Rajeev opened 7 months ago
hello @ChakshuGautam,
The hindi words displayed....what will be its format...like
Also is there some specific corpus of hindi text to be used?
hello @ChakshuGautam,
The hindi words displayed....what will be its format...like
- random words (like word by word making the user to pronounce and then moving on to the next word)
- certain paragraphs like in a story form where user can read the paragraph and the model scores.
Also is there some specific corpus of hindi text to be used?
the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.
Have added a sample dataset
hello @ChakshuGautam, The hindi words displayed....what will be its format...like
- random words (like word by word making the user to pronounce and then moving on to the next word)
- certain paragraphs like in a story form where user can read the paragraph and the model scores.
Also is there some specific corpus of hindi text to be used?
the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.
Have added a sample dataset
because this is to check if a person has read correctly or not, model needs to be more based on phonetics of the audio than relying on auto-regressively decoding for next word.
I would like to work on this project
Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.
Hello @GautamR-Samagra,
I've delved into this use case and stumbled upon some pre-trained models that yield promising results with just a minor bit of fine-tuning. Although I had limited resources on the free version of Colab, I managed to achieve notable improvements.
As depicted in the image, there's still some difference between the actual output and the output generated by the model.
My approach involves recording the .wav file and converting it into words, subsequently comparing them against the repository of pre-stored correct words and sentences to derive a score. This initial evaluation phase sets the stage for fine-tuning the model to suit our specific needs.
I would greatly appreciate any feedback or suggestions you may have on refining this approach.
Thank you.
hello @GautamR-Samagra @ChakshuGautam ,
I have worked upon the project where we are supposed to implement a read along app in offline mode. I started with first using Mel-frequency cepstral coefficients to find similarity between speeches to score them as MFCC are computationally efficient.
as MFCC may not capture all the aspects of pronunciation and also gives good similarity score with incomplete speech so I am currently tinkering with X-vectors.
Please reply with your valuable feedback. Thank you for your time and consideration.
hi @GautamR-Samagra @ChakshuGautam,
after tinkering with X-vectors, I got the following results.
It was time consuming and demanded high computation( not suitable for edge devices). In addition to this, it wasn't able to solve the existing problem with MFCCs. So, I will try to setup a workaround to tackle the above issue and I will keep you posted.
hi @GautamR-Samagra @ChakshuGautam,
I tried setting up the prototype the other way around and here are the results.
https://github.com/Samagra-Development/ai-tools/assets/97730338/6e7df262-f2d9-44d3-bc4a-105cbb6f8e93
this setup works in the offline environment and the score is not perfect because selected sentence contain inconsistent spacing. This model is based upon whisper(openai) and is quite large for edge device, so I will try to reduce the size of the model (*the time taken to predict the score is due to gradio's framework, not related to model). Any feedback would be good for the development of the project, this would mean a lot. Thank you for your time.
@RohanHBTU what did you use to create the Xvectors? Can you mentions which whisper model you used for the last comment?
hi @GautamR-Samagra @ChakshuGautam ,
the whisper model was too big in offline envrionment for an edge device even after quantization. So, tried another model which lightweight and low latency.
https://github.com/Samagra-Development/ai-tools/assets/97730338/5b2004e9-d834-477a-b7f1-c72c517a26c4
the model is only 42 mb(zipped) and 78 mb after extraction.
Hi @GautamR-Samagra, I wish to work on this project as a part of C4GT program. I am a pre-final year student at IIIT Delhi, India and I believe I will able to contribute positively to the project. Since I recently got to know about this program, and the deadline is approaching, could you please give me a clarity on what steps should I take to showcase you my dedication and make my proposal strong?
Furthermore, it'd be great if I could get your discord so that I can work directly under your supervision.
Awaiting your reply.
thank you
looking at other force alignment tools here
collate indicsuperb and NLapp datasets (create transcript using wav2vec + any other tool) --- (2)
audio-acoustic model dataset requirements finalised - word audio + word pairs -- (3)
convert 2 into word-pairs by using 1 (or any format required by acoustic embedding model). -- (4)
check out tiny denoisers (ideally less than 20 MB) like https://huggingface.co/qualcomm/Facebook-Denoiser/tree/main
training on a mixture of indic superb and NL dataset with acoustic-embedding based on this -- (5)
create test split from both the steno results (measuring ORF ) and the dataset created above. -- (6)
iterate on improving accuracy
Acoustic model implementations -
word detection for audios - solving for student pausing while speaking the word
model experiments :
@xorsuyash can you comment here so that I can assign this to you ?
@GautamR-Samagra
cc @GautamR-Samagra
Dataset Preparation
For training acoustic word embedding model we need a word and its corresponding audio pronunciation, for this we can leverage force alignment word by word of large amount of publicly available asr dataset which contains speech and its transcription. For this we utilized veterbi algo and backtracking which finds the most probable path of characters in the audio frames
Being able to segment audio word by word here Integrated as service in autotune here Using indicwave2vec2 and open-mms as a phoneme model for generating logits of the audios.
For model training we need audio transcript pair which will be the output of our force alignment pipeline. The format of audio transcript data which we are going to use for model training is here
Initial approach is to use a Bi-Lstm layer which is used to map the audios and transcript into a latent vector space and then we will use objective losses to which trains the model to map the acoustically similar words close in the latent space and also maps acoustically similar audio and its transcript in close in the latent space which then be used to match audio to its correct orthogonal segment.
Model architecture is here
@prabakaranc98 here
[ ] Creating dataset for creating audio phonetic model
[ ] Implementation of training loop based on the paper
[x] Creating dataset for creating audio phonetic model
[x] Implementation of training loop based on the paper Full training pipeline
Offline Alternative to Google's Read Along App in Hindi
Description
Develop an offline application (POC - web) that can display a set of Hindi words and accurately determine if the user has pronounced each word correctly. The app aims to be an educational tool for Hindi language learners, providing instant feedback on their pronunciation.
The application is envisioned as an offline tool similar to Google's Read Along app but specifically for the Hindi language. It should present users with Hindi words and listen to the user's attempt to pronounce these words, providing feedback on the accuracy of their pronunciation.
Approaches for Consideration:
Implementation Details:
This is an open invitation for contributors to suggest ideas, approaches, and potential technologies that could be utilized to achieve the project goals. Contributions at all stages of development are welcome, from conceptualization to implementation.
Goals & Mid-Point Milestone
Sample audio files:
Acceptance Criteria
Being able to create a lite model that is able to detect the subset of words that a child has correctly pronounced.
Mockups/Wireframes
Product Name
Nipun Lakshya App
Organisation Name
SamagraX
Domain
Education
Tech Skills Needed
Machine Learning, Natural Language Processing, Python
Mentor(s)
@GautamR-Samagra
Category
Machine Learning