SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.26k stars 939 forks source link

Large_v3 model can't change chunk_length? #624

Open YouWanHee opened 8 months ago

YouWanHee commented 8 months ago

Hello, SYSTRAN. I have a question about the chunk_length of Faster Whisper.

As the title suggests, there are often errors in large v3 model when changes are made to the chunk_length. The example I'm showing now is the result of inputting a Korean recorded audio file into Faster Whisper. git_chunk_length_30 The example sentence means, "STT is the process of recognizing voice and converting it into text." and below is the time taken. I will also upload a captured photo of the preprocessor_config.json. chunk_length_30 And next, this is a photo taken after changing the chunk_length to 10. The same sentence is repeated several times, and it takes several times longer. GIT_chunk_length10 I will also upload the code used for the transcription. code

My question is whether the modifications needed to use the large v3 model are preventing the change of chunk_length. Additionally, if this is not supposed to impact, can you explain why this phenomenon is occurring? Lastly, I post the versions of Faster Whisper, CTranslate2, tokenizers, and transformers that I am using.

Faster Whisper: 0.10.0 CTranslate2: 3.23.0 tokenizers: 0.13.3 transformers: 4.29.2

Purfview commented 8 months ago

I don't remember now exactly, but I think initial_prompt="" is not same as initial_prompt=None. And I think that min_silence_duration_ms for better results should be over ~3000.

Btw, large-v3 is prone to hallucinations, try large-v2.

YouWanHee commented 8 months ago

I don't remember now exactly, but I think initial_prompt="" is not same as initial_prompt=None. And I think that min_silence_duration_ms for better results should be over ~3000.

Btw, large-v3 is prone to hallucinations, try large-v2.

Thank you for your advice on the initial_prompt and min_silence_duration_ms. But I was actually planning to use the v3 model because it has better performance with Korean compared to the Large v2 model.

Then are you saying that there is no appropriate method to eliminate the hallucination caused by reducing the chunk_length in the v3 model? I am curious if this issue is due to the changes made in the faster whisper feature as a result of the mel-frequency being changed from 80 to 128.

Purfview commented 8 months ago

Then are you saying that there is no appropriate method to eliminate the hallucination caused by reducing the chunk_length in the v3 model?

Why are you reducing chunk? IMO, smaller chunk = more hallucinations/transcription errors.

I am curious if this issue is due to the changes made in the faster whisper feature as a result of the mel-frequency being changed from 80 to 128.

That's a feature of the original model.

The example sentence means, "STT is the process of recognizing voice and converting it into text."

Do you mean that it's an hallucination? Can you share an audio sample of it?

YouWanHee commented 8 months ago

Why are you reducing chunk? IMO, smaller chunk = more hallucinations/transcription errors.

I would like to reduce the chunk length because I want to minimize the process of padding and passing 30-second segments. I anticipate that most of the input audio will be within 10 seconds, so I'm trying to achieve the fastest possible response time. That's a feature of the original model.

I am aware that it is a feature of the original model, but I'm asking because, in my experience with other models, there seemed to be almost no hallucination unless I was recognizing audio that exceeded the chunk length. Do you mean that it's an hallucination? Can you share an audio sample of it?

The sentence is correct as it is. I just wanted to tell you what it says in English since it's originally in Korean. The issue was that when the chunk_length was set to 30, it worked well within 3 seconds without any hallucination. However, after reducing it to 10 seconds, the problem of the sentence repeating multiple times arose. And this audio is not in my voice, but in my colleague's. I will ask them and upload it if possible.

I'm sorry, I clicked it by mistake. ↓↓↓↓↓↓↓↓↓

syumi-1 commented 8 months ago

condition_on_previous_text: If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

try to set it

segments, info = model.transcribe(audio=f'whispertemp.mp3', beam_size=5, language=language, condition_on_previous_text=False, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500))

YouWanHee commented 8 months ago

condition_on_previous_text: If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

Sorry for the late reply, it was Christmas. I'll give it a try. Thank you for the helpful information. I will also update and post my comments regarding the subsequent outcomes.

Even after trying with various audio files, unfortunately, the hallucinations did not disappear. Is there a way to speed up voice recognition for short sentences, other than reducing beam size or chunk length, or using a smaller model size? I am already using a method of reducing the beam size because the accuracy for Korean is high when using the large model.

blackpolarz commented 8 months ago

Really late to the thread, but hope the following information helps. 1) As far as I know, the model itself is trained on 30seconds audio chunks so by reducing the padding, it significantly decreases its accuracy. For audio files less than 10seconds, it shouldn't take too much time assuming you have a decent CPU or better if you have GPU.

2) I assume the hallucinations you are referring to is the repetition of transcription. This is a side effect as the audio chunks are too small and the model attempts to "fill" it up. I actually faced the same issue and what I did to rectify it was to increase the repetition_penalty and keep a memory of the transcription/translation and post process by removing sentences that are repeated. (That's the naive approach but computationally less intense.) Another method would be to use the timestamps and logmel and weigh them and reconstruct the transcription. (More accurate but really computationally expensive.)

3) Reducing beam size is a good approach to speed up the recognition. Other than you can reduce the best_of to 1. The number of temperatures can also be reduced to slightly speed up the inference. If you have a gpu, you can also use CUDA instead in your whisper model.