EtienneAb3d / WhisperHallu

Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts
274 stars 22 forks source link

Empty Strings #30

Closed thegamercoder10 closed 3 months ago

thegamercoder10 commented 4 months ago

I use Faster Whisper in Azerbaijani language, beam size of 5 and patience of 2, hallucination silence threshold being 0.1 - with fine-tuned prompt and self-made markers - on <10s audios. Surprisingly, with the usage of markers I get empty string (feeding the marked audio to model.transcribe just returns "O.K. Whisper Whisper O.K."), albeit without them, in certain cases, it produces the correct transcript, or sometimes "Thanks for watching" hallucination.

So, what do you suggest?

Thanks in advance!

EtienneAb3d commented 4 months ago

In your prompt, did you put a text that could be the same as the vocal transcription? In this case, Whisper will often remove the vocal transcription because it could be a bad repetition of the prompt part. Try without prompt.

thegamercoder10 commented 4 months ago

Yes, unfortunately, it didn't help. Interestingly enough, I just fed .MRK.wav.CPS.wav to the model.transcribe without any additional arguments and received "O.K. Whisper Whisper O.K." with nothing between them, whereas feeding .SILCUT.wav.VAD.wav resulted in correct transcribe. This behaviour mostly happens on 8-9 s audios, rather than on 1-5 s audios, which I find strange enough.

thegamercoder10 commented 4 months ago

And yes, without the markers in short audios (sometimes in long ones as well) "Thanks for watching" appears.

EtienneAb3d commented 4 months ago

Try beamSize 1 ou 2. Whisper doesn't really like to increase it.

thegamercoder10 commented 4 months ago

But what about patience? 2 or 0? What is preferred decoding, in your opinion?

thegamercoder10 commented 4 months ago

And should I toggle condition_on_previous_text, compression_ratio_threshold, no_speech_threshold, hallucination_silence_threshold?

thegamercoder10 commented 4 months ago

@EtienneAb3d I tried with no prompt and beam size of 2, it didn't help either. Again, just O.K. Whisper Whisper O.K. is fetched.

EtienneAb3d commented 4 months ago

Often parameters are interesting to play with, in order to view their effects on the result, but simply using Whisper's default values brings better results. Which version of Whisper do you use? Large? Depending on use cases, V2 is often recognized to be better than V3.

thegamercoder10 commented 4 months ago

I use Faster Whisper large-v3. Hmm, let me check v2 then, with markers.

thegamercoder10 commented 4 months ago

@EtienneAb3d Now it works, but the transcription quality is low a bit.

thegamercoder10 commented 4 months ago

@EtienneAb3d BTW what do you suggest: I have divided same audio into seperate chunks, hence you might've observed the accentuation of lengths suchs as<10s. My objective is to transcribe it seperately for certain purposes. That is, I think providing the previous chunk transcribtion for current one will make it accurate. But adding it up to the prompt that needs tobe finetuned - namely, OK Whisper - makes the prompt longer, increasing the processing time. Then, how can I do it to achieve the most accurate transcibtion of chunk as possible?

thegamercoder10 commented 4 months ago

@EtienneAb3d After further experimenting with <10s chunks, even large-v2 with beam_size of 2 started giving no output except "ok whisper whisper ok". Its quite strange as everything worked perfectly, and suddenly stopped.

EtienneAb3d commented 4 months ago

You will certainly gain in quality by:

thegamercoder10 commented 4 months ago

The issue is that I am trying to build up a dataset for TTS, and more short lengthed audio should be prefferrable.

EtienneAb3d commented 4 months ago

Perhaps you may try longer files, extract them with timestamps (of course not very accurate with Whisper), and then cut the files afterward using these timestamps. You will then be able to cut at various lengths for each.

thegamercoder10 commented 4 months ago

@EtienneAb3d Then WhisperHallu with mode = 3 should be used?

BTW I'm conjecturing that the quantization in faster-whisper may have been culprit for this issue.

EtienneAb3d commented 4 months ago

Yes, for long files, mode = 3

thegamercoder10 commented 4 months ago

@EtienneAb3d its so peculiar that the same audio with markers fed to the same large-v2 whisper in colab works perfectly fine with prompt, albeit a lot stranger and worse on my computer...

So, I just took .MRK.WAV.CPS.WAV and fed to model.transcribe on my computer:

1) with prompt: Just Ok, whisper

2) w/o prompt: low quality transcription

On colab:

1) with prompt: high quality transcription

2) w/o prompt: low quality transcription

EtienneAb3d commented 4 months ago

Are you sure all parameters and conditions are strictly identical? This said, Whisper is not stable from one attempt to an other. Try to transcribe several times with both to evaluate with a large set of results.

thegamercoder10 commented 4 months ago

Yes, I even reinstalled the environment to match that of Colab. It just seems to work with higher beam size and without prompt better on my computer.

thegamercoder10 commented 4 months ago

@EtienneAb3d I tried to fed him up with 5 minute excerpt of audio with mode = 3, but prompt remains. Transcript is med-high quality, but at the end it started hallucinating with word repetition.

EtienneAb3d commented 4 months ago

Certainly Azerbaijani language is not the best supported language of Whisper. Perhaps you may try with an other technology, like SM4T. See the ReadMe instructions.

thegamercoder10 commented 4 months ago

SM4T is for translation. For transcribing, it still uses Whisper as far as I know

EtienneAb3d commented 4 months ago

As far as I know, SM4T is a model with no link with Whisper: multi-lingual and multi-modal (text+voice input+output).

thegamercoder10 commented 4 months ago

Unfortunately, Seamless Communicaion cannot be installed on my Windows, as the stem dependency fairseq2 is unsupported.

EtienneAb3d commented 4 months ago

As explain on their page, you may try WSL: Windows Subsystem for Linux

thegamercoder10 commented 3 months ago

To this end, I used stable-ts, as it encompassed all the necessary stuff needed for processing longer audio files as you have suggessted. Indeed, dividing segments into words is much faster and more reliable rather than processing little chunks.