abhirooptalasila / AutoSub

A CLI script to generate subtitle files (SRT/VTT/TXT) for any video using either DeepSpeech or Coqui
MIT License
586 stars 102 forks source link

is it possible to retrain on mistakes? #44

Open Kreijstal opened 2 years ago

Kreijstal commented 2 years ago

Given a false sub, would it be possible to give a correct sub, and retrain on it?

abhirooptalasila commented 2 years ago

That sounds like a great idea. Although, it'll be really difficult because we can't ensure that the audio is split correctly, and time offsets have to be perfect. And it's not practical to fine-tune on a single sample. Do you have any approaches in mind?

Kreijstal commented 2 years ago

Hmm maybe increase accessibility about how is the audio split? Also, how does the audio splitting work, is it also an AI?

abhirooptalasila commented 2 years ago

I segment on the silent parts of the audio by adapting some code from this project. It's not an AI. We can fine-tune the params while splitting, but it's not a one-size-fits-all solution.

xiaomao2013 commented 2 years ago

I am very happy to see your work. It really took a lot of effort. I don't know if you have any knowledge of NVIDIA NeMo. I found that NeMo's recognition efficiency is very high. I look forward to your time to make a version that uses NeMo as the recognition core. ^_^

abhirooptalasila commented 2 years ago

Hi I will check it out. Do you know if the model outputs timing information for the detected speech segments? Because that's how I build the subtitle files. Do you know which performs better: HuggingFace Wav2Vec or NeMo?

xiaomao2013 commented 2 years ago

Hi I will check it out. Do you know if the model outputs timing information for the detected speech segments? Because that's how I build the subtitle files. Do you know which performs better: HuggingFace Wav2Vec or NeMo?

In google you can test about the model outputs timing information for the detected speech segments please use BRANCH = 'v1.0.2' to test Since I really don't know HuggingFace Wav2Vec, don't know which is better But in the NeMo example, I saw that the individual phonetic words are easily separated, and the code is also there. The specific file location is
NeMo/examples/asr/
NeMo/tutorials/asr/01_ASR_with_NeMo.ipynb Offline_ASR.ipynb

xiaomao2013 commented 2 years ago

Hi I will check it out. Do you know if the model outputs timing information for the detected speech segments? Because that's how I build the subtitle files. Do you know which performs better: HuggingFace Wav2Vec or NeMo?

In google you can test about the model outputs timing information for the detected speech segments please use BRANCH = 'v1.0.2' to test Since I really don't know HuggingFace Wav2Vec, don't know which is better But in the NeMo example, I saw that the individual phonetic words are easily separated, and the code is also there. The specific file location is NeMo/examples/asr/ NeMo/tutorials/asr/01_ASR_with_NeMo.ipynb Offline_ASR.ipynb

I want to try to process video files with a long duration, but it seems that the program can only process wav files within 15 seconds. I also want to translate it into several other languages. The same problem is also limited. I look forward to your presentation. open source programs for these functions

Because the voice translation service often translates some content incorrectly, or deliberately translates it incorrectly, which leads to deviations in the cognition of many people. This is a dark moment. I hope more people can get the truth.

It is really helpless to use offline speech recognition programs and translation programs. In the face of deliberate misleading and harm, we can only use software and platforms that are out of their control, Now I use Gettr

xiaomao2013 commented 2 years ago

Thank you very much for providing an open source software that can fully implement from video and audio files to subtitle files.

I installed and used it, but I don't know which language should be recognized. I use English video files, but the effect seems to be bad. Can you tell me? Thank you so much

xiaomao2013 commented 2 years ago

Thank you very much for providing an open source software that can fully implement from video and audio files to subtitle files.

I installed and used it, but I don't know which language should be recognized. I use English video files, but the effect seems to be bad. Can you tell me? Thank you so much

Or can you tell me how to change the translation module to adapt to other countries' languages, thanks a lot

abhirooptalasila commented 2 years ago

I plan on implementing either one of Wav2Vec or NeMo, but will need some time. DeepSpeech has official models for American English, and there are some community-made models here. If you find a model file you want to use, download the .pbmm and .scorer files and give them as input to AutoSub.

Also, AutoSub can process large video files too. It automatically segments the audio into smaller chunks.

xiaomao2013 commented 2 years ago

Thank you very much for your guidance and hope to see the new program written by you soon.

I plan on implementing either one of Wav2Vec or NeMo, but will need some time. DeepSpeech has official models for American English, and there are some community-made models here. If you find a model file you want to use, download the .pbmm and .scorer files and give them as input to AutoSub.

Also, AutoSub can process large video files too. It automatically segments the audio into smaller chunks.

Thank you very much for your guidance and hope to see the new program written by you soon.

xiaomao2013 commented 2 years ago

I plan on implementing either one of Wav2Vec or NeMo, but will need some time. DeepSpeech has official models for American English, and there are some community-made models here. If you find a model file you want to use, download the .pbmm and .scorer files and give them as input to AutoSub.

Also, AutoSub can process large video files too. It automatically segments the audio into smaller chunks.

According to your prompt, I downloaded the corresponding module IN I find deepspeech-0.9.3-models-zh-CN.pbmm deepspeech-0.9.3-models-zh-CN.scorer

I download it and I test to run ,But the following error occurs

If it's convenient, please test it out to see how you can get the module to work

Thank you so much


(sub) (base) gettr@gettr:~/AutoSub$ python3 autosub/main.py --model deepspeech-0.9.3-models-zh-CN.pbmm --scorer deepspeech-0.9.3-models-zh-CN.scorer --file ~/3-720.mp4 ARGS: Namespace(dry_run=False, file='/home/gettr/3-720.mp4', format=['srt', 'vtt', 'txt'], model='deepspeech-0.9.3-models-zh-CN.pbmm', scorer='deepspeech-0.9.3-models-zh-CN.scorer', split_duration=5) Model: /home/gettr/AutoSub/deepspeech-0.9.3-models-zh-CN.pbmm Scorer: /home/gettr/AutoSub/deepspeech-0.9.3-models-zh-CN.scorer Input file: /home/gettr/3-720.mp4 Extracted audio to audio/3-720.wav Splitting on silent parts in audio file

Running inference: TensorFlow: v2.3.0-6-g23ad988 DeepSpeech: v0.9.3-0-gf2e9c85 0%| | 0/17 [00:07<?, ?it/s] Traceback (most recent call last): File "autosub/main.py", line 165, in main() File "autosub/main.py", line 156, in main ds_process_audio(ds, audio_segment_path, output_file_handle_dict, split_duration=args.split_duration) File "autosub/main.py", line 66, in ds_process_audio write_to_file(output_file_handle_dict, split_inferred_text, line_count, split_limits, cues) File "/home/gettr/AutoSub/autosub/writeToFile.py", line 43, in write_to_file file_handle.write(inferred_text + "\n\n") UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-88: surrogates not allowed

xiaomao2013 commented 2 years ago

I plan on implementing either one of Wav2Vec or NeMo, but will need some time. DeepSpeech has official models for American English, and there are some community-made models here. If you find a model file you want to use, download the .pbmm and .scorer files and give them as input to AutoSub. Also, AutoSub can process large video files too. It automatically segments the audio into smaller chunks.

According to your prompt, I downloaded the corresponding module IN I find deepspeech-0.9.3-models-zh-CN.pbmm deepspeech-0.9.3-models-zh-CN.scorer

I download it and I test to run ,But the following error occurs

If it's convenient, please test it out to see how you can get the module to work

Thank you so much

(sub) (base) gettr@gettr:~/AutoSub$ python3 autosub/main.py --model deepspeech-0.9.3-models-zh-CN.pbmm --scorer deepspeech-0.9.3-models-zh-CN.scorer --file ~/3-720.mp4 ARGS: Namespace(dry_run=False, file='/home/gettr/3-720.mp4', format=['srt', 'vtt', 'txt'], model='deepspeech-0.9.3-models-zh-CN.pbmm', scorer='deepspeech-0.9.3-models-zh-CN.scorer', split_duration=5) Model: /home/gettr/AutoSub/deepspeech-0.9.3-models-zh-CN.pbmm Scorer: /home/gettr/AutoSub/deepspeech-0.9.3-models-zh-CN.scorer Input file: /home/gettr/3-720.mp4 Extracted audio to audio/3-720.wav Splitting on silent parts in audio file

Running inference: TensorFlow: v2.3.0-6-g23ad988 DeepSpeech: v0.9.3-0-gf2e9c85 0%| | 0/17 [00:07<?, ?it/s] Traceback (most recent call last): File "autosub/main.py", line 165, in main() File "autosub/main.py", line 156, in main ds_process_audio(ds, audio_segment_path, output_file_handle_dict, split_duration=args.split_duration) File "autosub/main.py", line 66, in ds_process_audio write_to_file(output_file_handle_dict, split_inferred_text, line_count, split_limits, cues) File "/home/gettr/AutoSub/autosub/writeToFile.py", line 43, in write_to_file file_handle.write(inferred_text + "\n\n") UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-88: surrogates not allowed

———————————————————————————————————————————— I found an instruction, but I don't know how to do it LINK