Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

[DMP 2024]: Create offline audio-phoetic matching model #313

Open Gautam-Rajeev opened 7 months ago

Gautam-Rajeev commented 7 months ago

Offline Alternative to Google's Read Along App in Hindi

Description

Develop an offline application (POC - web) that can display a set of Hindi words and accurately determine if the user has pronounced each word correctly. The app aims to be an educational tool for Hindi language learners, providing instant feedback on their pronunciation.

The application is envisioned as an offline tool similar to Google's Read Along app but specifically for the Hindi language. It should present users with Hindi words and listen to the user's attempt to pronounce these words, providing feedback on the accuracy of their pronunciation.

Approaches for Consideration:

Implementation Details:

This is an open invitation for contributors to suggest ideas, approaches, and potential technologies that could be utilized to achieve the project goals. Contributions at all stages of development are welcome, from conceptualization to implementation.

Goals & Mid-Point Milestone

Sample audio files:

Acceptance Criteria

Being able to create a lite model that is able to detect the subset of words that a child has correctly pronounced.

Mockups/Wireframes

Product Name

Nipun Lakshya App

Organisation Name

SamagraX

Domain

⁠Education

Tech Skills Needed

Machine Learning, Natural Language Processing, Python

Mentor(s)

@GautamR-Samagra

Category

Machine Learning

Azazel0203 commented 7 months ago

hello @ChakshuGautam,

The hindi words displayed....what will be its format...like

  1. random words (like word by word making the user to pronounce and then moving on to the next word)
  2. certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

Gautam-Rajeev commented 7 months ago

hello @ChakshuGautam,

The hindi words displayed....what will be its format...like

  1. random words (like word by word making the user to pronounce and then moving on to the next word)
  2. certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.

Have added a sample dataset

Gautam-Rajeev commented 7 months ago

hello @ChakshuGautam, The hindi words displayed....what will be its format...like

  1. random words (like word by word making the user to pronounce and then moving on to the next word)
  2. certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.

Have added a sample dataset

because this is to check if a person has read correctly or not, model needs to be more based on phonetics of the audio than relying on auto-regressively decoding for next word.

horoshiny commented 7 months ago

I would like to work on this project

AbhimanyuSamagra commented 7 months ago

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

Azazel0203 commented 7 months ago

Hello @GautamR-Samagra,

I've delved into this use case and stumbled upon some pre-trained models that yield promising results with just a minor bit of fine-tuning. Although I had limited resources on the free version of Colab, I managed to achieve notable improvements.

Attached Image: actual_output vs output_generated

As depicted in the image, there's still some difference between the actual output and the output generated by the model.

My approach involves recording the .wav file and converting it into words, subsequently comparing them against the repository of pre-stored correct words and sentences to derive a score. This initial evaluation phase sets the stage for fine-tuning the model to suit our specific needs.

I would greatly appreciate any feedback or suggestions you may have on refining this approach.

Thank you.

RohanHBTU commented 7 months ago

hello @GautamR-Samagra @ChakshuGautam ,

I have worked upon the project where we are supposed to implement a read along app in offline mode. I started with first using Mel-frequency cepstral coefficients to find similarity between speeches to score them as MFCC are computationally efficient.

image

as MFCC may not capture all the aspects of pronunciation and also gives good similarity score with incomplete speech so I am currently tinkering with X-vectors.

Please reply with your valuable feedback. Thank you for your time and consideration.

RohanHBTU commented 7 months ago

hi @GautamR-Samagra @ChakshuGautam,

after tinkering with X-vectors, I got the following results.

image

It was time consuming and demanded high computation( not suitable for edge devices). In addition to this, it wasn't able to solve the existing problem with MFCCs. So, I will try to setup a workaround to tackle the above issue and I will keep you posted.

RohanHBTU commented 7 months ago

hi @GautamR-Samagra @ChakshuGautam,

I tried setting up the prototype the other way around and here are the results.

https://github.com/Samagra-Development/ai-tools/assets/97730338/6e7df262-f2d9-44d3-bc4a-105cbb6f8e93

this setup works in the offline environment and the score is not perfect because selected sentence contain inconsistent spacing. This model is based upon whisper(openai) and is quite large for edge device, so I will try to reduce the size of the model (*the time taken to predict the score is due to gradio's framework, not related to model). Any feedback would be good for the development of the project, this would mean a lot. Thank you for your time.

Gautam-Rajeev commented 7 months ago

@RohanHBTU what did you use to create the Xvectors? Can you mentions which whisper model you used for the last comment?

RohanHBTU commented 7 months ago

hi @GautamR-Samagra @ChakshuGautam ,

the whisper model was too big in offline envrionment for an edge device even after quantization. So, tried another model which lightweight and low latency.

https://github.com/Samagra-Development/ai-tools/assets/97730338/5b2004e9-d834-477a-b7f1-c72c517a26c4

the model is only 42 mb(zipped) and 78 mb after extraction.

Ashutosh-Gera commented 6 months ago

Hi @GautamR-Samagra, I wish to work on this project as a part of C4GT program. I am a pre-final year student at IIIT Delhi, India and I believe I will able to contribute positively to the project. Since I recently got to know about this program, and the deadline is approaching, could you please give me a clarity on what steps should I take to showcase you my dedication and make my proposal strong?

Furthermore, it'd be great if I could get your discord so that I can work directly under your supervision.

Awaiting your reply.

thank you

Gautam-Rajeev commented 5 months ago
  1. Creating an alignment model that is able to take an input of audio-trasncripts combinations of any length and provide as output a list of audio-transcript combinations of any set audio length / word length

looking at other force alignment tools here

  1. collate indicsuperb and NLapp datasets (create transcript using wav2vec + any other tool) --- (2)

  2. audio-acoustic model dataset requirements finalised - word audio + word pairs -- (3)

  3. convert 2 into word-pairs by using 1 (or any format required by acoustic embedding model). -- (4)

  4. check out tiny denoisers (ideally less than 20 MB) like https://huggingface.co/qualcomm/Facebook-Denoiser/tree/main

  5. training on a mixture of indic superb and NL dataset with acoustic-embedding based on this -- (5)

  6. create test split from both the steno results (measuring ORF ) and the dataset created above. -- (6)

  7. iterate on improving accuracy

  8. Acoustic model implementations -

    • train on metaphone phonetic conversion instead of the transcripts directly as shown here
  9. word detection for audios - solving for student pausing while speaking the word

  10. model experiments :

    • fine tune and quanitize whisper and measure ORF (oral reading frequency) by setting cutoff on token probabilities - be able to get token probabilities for stream-like whisper and carry out the above
Gautam-Rajeev commented 5 months ago

@xorsuyash can you comment here so that I can assign this to you ?

xorsuyash commented 5 months ago

@GautamR-Samagra

xorsuyash commented 5 months ago

cc @GautamR-Samagra

Training acoustic word embedding model to optimize audio transcript matching

Image Model architecture is here

Multilingual jointly trained acoustic word embedding model

DMP proposal

Gautam-Rajeev commented 5 months ago

@prabakaranc98 here

Gautam-Rajeev commented 3 months ago

Weekly Goals

Week 1

Week 2

Week 3

Week 4

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

xorsuyash commented 3 months ago

Weekly Goals

Week 1

Week 2

Week 3

Week 4

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12