Open luquitared opened 7 months ago
I tried right now, but it seems to still output English text for bigger files than samples in repo, and I wanted to translate it to Polish.
./main -m $(pwd)/models/ggml-large-v3-q5_0.bin -f samples/jfk.wav --output-srt -fa -t 8 -ng -l pl
whisper_init_from_file_with_params_no_state: loading model from '/home/xxxxx/whisper.cpp/models/ggml-large-v3-q5_0.bin'
whisper_init_with_params_no_state: use gpu = 0
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 8
whisper_model_load: qntvr = 2
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_model_load: CPU total size = 1080.47 MB
whisper_model_load: model size = 1080.47 MB
whisper_backend_init: using BLAS backend
whisper_init_state: kv self size = 83.89 MB
whisper_init_state: kv cross size = 251.66 MB
whisper_init_state: kv pad size = 7.86 MB
whisper_init_state: compute buffer (conv) = 36.13 MB
whisper_init_state: compute buffer (encode) = 55.33 MB
whisper_init_state: compute buffer (cross) = 9.25 MB
whisper_init_state: compute buffer (decode) = 99.10 MB
system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = pl, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] I tak, moi kościołowie Amerykanie, nie zapytajcie o to, co twoje kraj może zrobić dla ciebie, zapytajcie o to, co ty możesz zrobić dla twojego kraju.
output_srt: saving output to 'samples/jfk.wav.srt'
whisper_print_timings: load time = 805.01 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 24.72 ms
whisper_print_timings: sample time = 1343.19 ms / 261 runs ( 5.15 ms per run)
whisper_print_timings: encode time = 61213.75 ms / 1 runs (61213.75 ms per run)
whisper_print_timings: decode time = 56.54 ms / 1 runs ( 56.54 ms per run)
whisper_print_timings: batchd time = 477101.03 ms / 258 runs ( 1849.23 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 541045.50 ms
./main -m $(pwd)/models/ggml-large-v3-q5_0.bin -f target_file.wav --output-srt -ng -fa -t 8 -l pl
whisper_init_from_file_with_params_no_state: loading model from '/home/xxxxxxxx/whisper.cpp/models/ggml-large-v3-q5_0.bin'
whisper_init_with_params_no_state: use gpu = 0
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 8
whisper_model_load: qntvr = 2
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_model_load: CPU total size = 1080.47 MB
whisper_model_load: model size = 1080.47 MB
whisper_backend_init: using BLAS backend
whisper_init_state: kv self size = 83.89 MB
whisper_init_state: kv cross size = 251.66 MB
whisper_init_state: kv pad size = 7.86 MB
whisper_init_state: compute buffer (conv) = 36.13 MB
whisper_init_state: compute buffer (encode) = 55.33 MB
whisper_init_state: compute buffer (cross) = 9.25 MB
whisper_init_state: compute buffer (decode) = 99.10 MB
system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing 'target_file.wav' (116050261 samples, 7253.1 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = pl, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:22.060] Hi everyone. Thanks for joining today.[...]
A cool feature that might be worth exploring would be allowing users to translate to any target language, rather than just english.
It is known that whisper was trained to take input language --> english only, but this repo shows that you can force whisper to decode to a specific language: https://github.com/Vaibhavs10/translate-with-whisper
Would be fun to do a test on this with whisper.cpp
Originally posted by @luquitared in https://github.com/ggerganov/whisper.cpp/issues/1219#issuecomment-1998577753