abhirooptalasila / AutoSub

A CLI script to generate subtitle files (SRT/VTT/TXT) for any video using either DeepSpeech or Coqui
MIT License
581 stars 102 forks source link

Some words are missing #48

Closed ibndias closed 2 years ago

ibndias commented 2 years ago

Hi, thanks for the great project!

I have a problem with some words that are missing in the transcript. But if I just do transcribe the same audio using only deepspeech project (not autosub with ds engine), there are no missing words.

Are there any tweaks that can be done by parameter? or is it because silent segment removal process?

Here is the txt output from autosub with ds engine

biggest . 

people make when larry english and probably one of the most common miss. 

people think that they. 

don't study. 

live in. 

an out let me explain what i . 

one does studying men and how do people usually approach this pro. 

and how do people. 

And here is the deepspeech output.

the biggest mistake people make when morning english and probably one of the most common misconceptions is that people think that they need to study english and usedn't study english live english an outlet explain what i mean one does studying men and how do people 

As you can see, some words are missing on autosub output.

I am using same deepspeech 0.9.3 version and model both on autosub and deepspeech.

abhirooptalasila commented 2 years ago

Hi Are you using the latest version of AutoSub? If yes, I switched the default inference to Coqui STT as it has better support for different languages. You can change this by setting --engine to "ds" while running main.py and checking again. If you are sure that you're using DeepSpeech, you can play around with the default parameter values here.

ibndias commented 2 years ago

Are you using the latest version of AutoSub?

Yes I am using the latest master branch

If yes, I switched the default inference to Coqui STT as it has better support for different languages. You can change this by setting --engine to "ds" while running main.py and checking again.

Yes I also did change the engine to deepspeech

(sub) derry@10700k:~/ws/AutoSub$ python3 autosub/main.py --engine ds --file ./qjbBeORPUA4-oo9mOmdonl.mp4 
[INFO] ARGS: Namespace(dry_run=False, engine='ds', file='./qjbBeORPUA4-oo9mOmdonl.mp4', format=['srt', 'vtt', 'txt'], model=None, scorer=None, split_duration=5)
[INFO] Model: /home/derry/ws/AutoSub/deepspeech-0.9.3-models.pbmm
[INFO] Scorer: /home/derry/ws/AutoSub/deepspeech-0.9.3-models.scorer
[INFO] Input file: ./qjbBeORPUA4-oo9mOmdonl.mp4
[INFO] Extracted audio to audio/qjbBeORPUA4-oo9mOmdonl.wav
[INFO] Splitting on silent parts in audio file
[INFO] Running inference...
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
...

play around with the default parameter values here.

Thanks for the hints, I got 'better' results using these numbers smoothing_window=0.5, weight=0.01. However, I don't really understand how this parameter works. And also this magic number for st_win and st_step. Can you explain a little bit?

I think adding a switch to disable for silence removal is needed for non-movie video (full conversation). :)

abhirooptalasila commented 2 years ago

This is a better explanation. Thanks for the suggestion about silence removal. I'll think about how to decouple it from splitting the file.