dylman123 / final-captions-pro

A MacOS app which automatically generates customisable open captions in Final Cut Pro X
1 stars 0 forks source link

Improve speech recognition accuracy! #14

Open dylman123 opened 4 years ago

dylman123 commented 4 years ago

Try improve:

Things to try:

  1. Remove noise from audio clip
  2. Research Google Speech to Text API - and how to improve performance?
  3. Potentially train the model using other video data from Darren?
  4. Are there any other ML models to try?
dylman123 commented 4 years ago

https://cloud.google.com/speech-to-text/docs/best-practices?hl=en_US

dylman123 commented 4 years ago

Google Speech to Text API seems very capable: https://www.youtube.com/watch?v=jOYzvq5dBrQ

dylman123 commented 4 years ago

May be possible to train your own STT model with IBM Watson... interesting: https://medium.com/ibm-watson/watson-speech-to-text-how-to-train-your-own-speech-dragon-part-1-data-collection-and-fdd8cea4f4b8

dylman123 commented 4 years ago

Maybe longer clips give more accurate results?

dylman123 commented 4 years ago

Rev API is apparently better than Google at Speech to Text?

https://www.rev.ai

Rev API also returns timestamp data :)

dylman123 commented 4 years ago

Evaluating Rev API speaker diarization performance

I contacted Rev API support (via email):

13 April 2020

My use case requires reasonable diarization accuracy. I am currently comparing between Google Cloud Speech API and Rev AI API.

In my sample audio clip, there are 2 speakers (1 male and 1 female). However the output from Rev AI only detected 1 speaker.

My audio file only has a single channel.

I have made sure to pass in the option: skip_diarization = false, which is the default value anyway. Referring to schema: https://www.rev.ai/docs#operation/SubmitTranscriptionJob

Is this expected performance? Or am I doing something wrong?

Reply from Rev API support:

17 April 2020

After some investigation by our engineering team we found that diarization failed on this file because of the short length and fast speaker switches. This is something we are actively trying to improve. Do you have more files like this? If so, would you be able to share them with us?

Note: the file in question has a duration of 43 seconds.

dylman123 commented 4 years ago

New Google Cloud Speech-to-Text parameters available

Link: https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#SpeechContext

The import wizard should display optional user settings which then get sent to the transcription service for processing.