Different output of faster-whisper vs whisper-standalone with faster-whisper

tjongsma commented 1 month ago

Hi there,

I've recently been trying to implement faster-whisper into a python application for streaming. I noticed that (especially for short clips) whisper-standalone (which I've been using for a while with faster-whisper) has significantly better output than faster-whisper. I've tried to get the parameters as similarly as possible by mostly filling into default values of whisper-standalone into the faster-whisper command (where possible, temperature_increment_on_fallback and vad_window_size_samples don't have equivalents I think, but somehow whisper-standalone is still outperforming massively for this one somewhat noisy audio file of ~29s. Faster-whisper hallucinates a bunch, while whisper-standalone (using faster-whisper) correctly transcribes the text.

Here's my whisper-standalone command: Whisper\whisper-faster.exe "audio_files\temp_audio.wav" --task=transcribe --model=medium --output_dir="C:\Users\test\Desktop\Audio_transcription\audio_transcripts" --output_format=txt --beam_size=5 --temperature=0 --language=nl --word_timestamps=True --initial_prompt="" --compute_type float16 --vad_filter=False

And here's my code for faster-whisper in python:

from faster_whisper import WhisperModel
import os
import sys
base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
#Loading model
model_size = "medium"
cache_dir=os.path.join(base_dir, "models", "cache")
model_dir = os.path.join(base_dir, "models", "cache", "models--Systran--faster-whisper-medium", "snapshots", "08e178d48790749d25932bbc082711ddcfdfbc4f")
model = WhisperModel(model_dir, device="cuda", compute_type="float16",download_root=cache_dir)

class WordWithTimestamp:
    def __init__(self, word, start, end):
        self.word = word
        self.start = start
        self.end = end

    def __str__(self):
        return self.word
segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                        #max_new_tokens=224,
                                        beam_size=5,
                                        temperature=0,
                                        language="nl",
                                        length_penalty=1,
                                        condition_on_previous_text=True,
                                        prompt_reset_on_temperature=0.5,
                                       # temperature_increment_on_fallback=0.2,
                                        compression_ratio_threshold=2.4,
                                        log_prob_threshold=-1,
                                        no_speech_threshold=0.6,
                                        vad_filter=False,
                                        word_timestamps=True)
text = []
# Iterate over the segments and store words with timestamps
for segment in segments:
    for word in segment.words:
        text.append(WordWithTimestamp(word.word, word.start, word.end))
text_str = "".join(str(word) for word in text)
print(text_str)

Any insights would be appreciated! I'm basically just wondering how whisper-standalone is performing so much better while I understand it to be the same backend. Using VAD results in the exact same results, so it has to be somewhere else somehow.

Jiltseb commented 1 month ago

It would be great if you could share the audio here, I will have a look. By whisper standalone, did you mean openAI whisper (original directory)? Also, have you tried with vad_filetr = True?

tjongsma commented 1 month ago

Thank you so much! I couldn't upload a .wav, so I converted using https://www.freeconvert.com/wav-to-mp4/download. Here's the file https://github.com/user-attachments/assets/809ce6e2-d0fe-466d-ac17-1576ef30e1d5

It's a recording over mic of a youtube video of some random podcast i found that I use because it has multiple speakers in a somewhat realistic environment. (link: https://www.youtube.com/watch?v=Y-8QyPNyumk). The output I get for faster-whisper is

Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Deze hele uitzending is naar strak van het slechte nieuws 
aangenaam. Deze hele uitzending is naar strak van het slechte nieuws aangenaam. Bedankt voor het kijken

Whereas for whisper-standalone it is:

[00:03.860 --> 00:14.300]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[00:14.840 --> 00:17.480]  Het is nog nooit echt zo extreem geweest als in deze uitzending.
[00:17.720 --> 00:19.760]  Ik denk dat de relatie van mijn eigen redactie heel slecht gaat.
[00:19.880 --> 00:23.660]  Dat ze de bloem voor ons steeds bevonden om een hele depressie waard zijn om te maken.
[00:23.860 --> 00:26.240]  Oh nee, dat is dus niet... Nee, het gaat goed. Oké, niks aan hand.
[00:26.380 --> 00:27.620]  Oh, ze willen niks over zeggen, zie ik nou.

You might not get 100% the same results, as the file was converted to mp4. And yes, I did try vad_filter=True (for both implementations), I get very similar results.

tjongsma commented 1 month ago

On and sorry didn't see this part. When I say whisper standalone, I mean https://github.com/Purfview/whisper-standalone-win.

Jiltseb commented 1 month ago

I don't see any problem with the model. You are only returning self.word which are appended together and joins to form a single text string. So, time stamps are definitely not included in the output!

I ran the code and got exactly the same results as your standalone whisper:

[00:03.860 --> 00:14.300]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[00:14.840 --> 00:17.480]  Het is nog nooit echt zo extreem geweest als in deze uitzending.
[00:17.720 --> 00:19.760]  Ik denk dat de relatie van mijn eigen redactie heel slecht gaat.
[00:19.880 --> 00:23.660]  Dat ze de bloem voor ons steeds bevonden om een hele depressie waard zijn om te maken.
[00:23.860 --> 00:26.240]  Oh nee, dat is dus niet... Nee, het gaat goed. Oké, niks aan hand.
[00:26.380 --> 00:27.620]  Oh, ze willen niks over zeggen, zie ik nou.

After you get the segments, just call the iterator:

for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

You don't need the class WordWithTimestamp.

tjongsma commented 1 month ago

Ah thank you for checking! I'm aware timestamps are not included, I was just referring to the text, apologies for the confusion. The fact that you're getting the better output is quite odd, would you mind giving me your full code? Perhaps it has something to do with your paramaters or how you convert/load the .mp4. When I run with your iterator and everything else like I posted before again I get

[3.64s -> 14.30s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[14.84s -> 16.76s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[16.76s -> 17.44s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[17.66s -> 18.10s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[18.10s -> 18.46s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[18.46s -> 19.56s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[19.56s -> 21.40s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[21.40s -> 24.92s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[25.52s -> 27.68s]  Deze hele uitzending is naar strak van het slechte nieuws aangenaam.
[27.68s -> 27.78s]  Bedankt voor het kijken

The class is just because I'm using the timestamps in my streaming application, but good to know I don't need it.

Jiltseb commented 1 month ago

Here is the code:

from faster_whisper import WhisperModel
model_name = "medium"
model = WhisperModel(model_name, device="cuda", compute_type="float16")
path ="/path_to_your.mp4"

segments, info = model.transcribe(path, initial_prompt="",
                                        beam_size=5,
                                        temperature=0,
                                        language="nl",
                                        length_penalty=1,
                                        condition_on_previous_text=True,
                                        prompt_reset_on_temperature=0.5,
                                        compression_ratio_threshold=2.4,
                                        log_prob_threshold=-1,
                                        no_speech_threshold=0.6,
                                        vad_filter=False,
                                        word_timestamps=True)

for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

Actually the outputs are a bit different, (copy-pasting the correct one here), but it is not hallucinated/repeated as I see in your comment:

[3.88s -> 17.48s]  Deze hele uitzending is daar strak van, dat slechte nieuws aangenaam is er nog nooit echt zo extreem geweest als in deze uitzending.
[17.72s -> 23.66s]  Ik denk dat de relatie van mijn eigen redactie heel slecht gaat, dat ze de bloem voor ons steeds bevonden om een hele, nou, depressie waard zijn het wel te maken.
[23.88s -> 27.62s]  Oh nee, dat is dus niet, nee, het gaat goed, oké, niks aan de hand. Oh, zullen we niks over zeggen, zie ik nou.

tjongsma commented 1 month ago

That's interesting, because as far as I know faster_whisper doesn't support .mp4 right? So using your code I get

Traceback (most recent call last):
  File "c:\Users\tjong\Desktop\Audio_transcription_dev\whisper_streaming\git_test.py", line 6, in <module>
    segments, info = model.transcribe(path, initial_prompt="",
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\faster_whisper\transcribe.py", line 838, in transcribe
    audio = decode_audio(audio, sampling_rate=sampling_rate)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\faster_whisper\audio.py", line 26, in decode_audio    
    waveform, audio_sf = torchaudio.load(input_file)  # waveform: channels X T
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\torchaudio\_backend\utils.py", line 205, in load      
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\torchaudio\_backend\soundfile.py", line 27, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\torchaudio\_backend\soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tjong\Desktop\Audio_transcription_dev\Resemblyzer-master\.venv\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening 'temp_audio.mp4': Format not recognised.

If I replace path with the .wav I get the same output (the hallucinations) again.

Jiltseb commented 1 month ago

faster_whisper supports video formats such as mp4. If you are using the main repo, the audio is loaded via torchaudio. if you have ffmpeg installed in your system, it should work.

tjongsma commented 1 month ago

Alright thanks so much! Bit of a weird one, but you helped me figure out the problem. I had installed your experimental branch that supports batched inference, and for some reason that changed my results even when not using batched inference. Re-installing faster_whisper with the main branch fixed this and now I have the expected results. So this is solved for me now, maybe a bit of a warning for the batched branch though?

Jiltseb commented 1 month ago

I am not sure which branch you are mentioning but using experimental branches is prone to review changes. Anyways, I am glad that it worked!

SYSTRAN / faster-whisper

Different output of faster-whisper vs whisper-standalone with faster-whisper #1027