Improve speech recognition accuracy!

dylman123 commented 4 years ago

Try improve:

Word recognition accuracy
Speaker diarisation/identification accuracy

Things to try:

Remove noise from audio clip
Research Google Speech to Text API - and how to improve performance?
Potentially train the model using other video data from Darren?
Are there any other ML models to try?

dylman123 commented 4 years ago

https://cloud.google.com/speech-to-text/docs/best-practices?hl=en_US

According to Google Cloud Speech to Text, it is best practise to avoid preprocessing the audio. So don't use a noise reduction filter!
.wav is currently being used, which is good since it is not lossy.
Could potentially use speech adaptation for better performance? But this works best if there are specific frequent phrases/words: https://cloud.google.com/speech-to-text/docs/speech-adaptation
Or class tokens: https://cloud.google.com/speech-to-text/docs/class-tokens

dylman123 commented 4 years ago

Google Speech to Text API seems very capable: https://www.youtube.com/watch?v=jOYzvq5dBrQ

dylman123 commented 4 years ago

May be possible to train your own STT model with IBM Watson... interesting: https://medium.com/ibm-watson/watson-speech-to-text-how-to-train-your-own-speech-dragon-part-1-data-collection-and-fdd8cea4f4b8

dylman123 commented 4 years ago

Maybe longer clips give more accurate results?

dylman123 commented 4 years ago

Rev API is apparently better than Google at Speech to Text?

https://www.rev.ai

Rev API also returns timestamp data :)

Does diarization however I haven't been able to do this successfully so far. API has been returning 1 speaker only when clearly there were 2 speakers (1 male, 1 female).

dylman123 commented 4 years ago

Evaluating Rev API speaker diarization performance

I contacted Rev API support (via email):

13 April 2020

My use case requires reasonable diarization accuracy. I am currently comparing between Google Cloud Speech API and Rev AI API.

In my sample audio clip, there are 2 speakers (1 male and 1 female). However the output from Rev AI only detected 1 speaker.

My audio file only has a single channel.

I have made sure to pass in the option: skip_diarization = false, which is the default value anyway. Referring to schema: https://www.rev.ai/docs#operation/SubmitTranscriptionJob

Is this expected performance? Or am I doing something wrong?

Reply from Rev API support:

17 April 2020

After some investigation by our engineering team we found that diarization failed on this file because of the short length and fast speaker switches. This is something we are actively trying to improve. Do you have more files like this? If so, would you be able to share them with us?

Note: the file in question has a duration of 43 seconds.

dylman123 commented 4 years ago

New Google Cloud Speech-to-Text parameters available

Link: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#SpeechContext

The import wizard should display optional user settings which then get sent to the transcription service for processing.

Also can add a 'Start Over' button, followed by a 'Are you sure?' prompt for the user to re-transcribe with different params.

dylman123 / final-captions-pro