A part of the beginning of my audio was cut

ubanning commented 1 year ago

Hello, first of all thank you very much for your work on this project, it really was much faster and consumed less RAM and VRAM. I'm testing and unfortunately a significant part of my audio has been cut. My audio is in Portuguese and has 13 minutes, apparently the problem only occurred at the beginning of it. Is there a way to solve this problem? I used the following code:

from faster_whisper import WhisperModel

model_path = "whisper-medium-ct2/"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_path, device="cpu", compute_type="int8")

segments, info = model.transcribe("jota.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%ds -> %ds] %s" % (segment.start, segment.end, segment.text))

The result I got running the standard version of Whisper on the same medium model:

[00:00.000 --> 00:03.500]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
--------------------- CUT ---------------------
[00:03.500 --> 00:05.500]  É isso que a gente vai responder neste vídeo.
[00:05.500 --> 00:10.500]  Música
[00:10.500 --> 00:13.500]  Muito bem, todos aqueles que estão chegando agora aqui no canal, meu nome é Fernando Urch,
[00:13.500 --> 00:16.500]  aqui a gente fala de economia, mercados e investimentos, se vocês gostarem do conteúdo,
[00:16.500 --> 00:21.000]  considerem se inscrever, ativando o sininho aqui embaixo e também compartilhando este vídeo.
[00:21.000 --> 00:24.500]  Pois o assunto de inflação é recorrente aqui no canal pela sua importância,
[00:24.500 --> 00:30.500]  o impacto que tem na nossa vida financeira, profissional, na economia, na vida em sociedade.
--------------------- CUT ---------------------
[00:30.500 --> 00:34.500]  E o debate em torno da relação entre impressão de moeda e inflação,
[00:34.500 --> 00:38.500]  ele ressurge de tempos em tempos, como foi lá no início da pandemia,
[00:38.500 --> 00:42.500]  quando muitos economistas, banqueiros centrais, políticos,
[00:42.500 --> 00:47.500]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[00:47.500 --> 00:50.500]  afirmavam categoricamente que imprimir dinheiro
[00:50.500 --> 00:54.500]  não geraria inflação naquele momento, naquelas circunstâncias.
[00:54.500 --> 00:57.500]  E a verdade é que não é tão simples responder essa pergunta,
[00:57.500 --> 01:02.500]  porque imprimir dinheiro não necessariamente vai gerar inflação,
[01:02.500 --> 01:05.500]  depende de outros fatores, depende das circunstâncias.
[01:05.500 --> 01:09.500]  Mas sim que imprimir dinheiro é sempre um fator inflacionário.

The result I got running this faster version of Whisper:

Detected language 'pt' with probability 0.996094
[0s -> 3s]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
[30s -> 36s]  E o debate em torno da relação entre impressão de moeda e inflação resurge de tempos em tempos,
[36s -> 42s]  como foi lá no início da pandemia, quando muitos economistas, banqueiros centrais, políticos,
[42s -> 47s]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[47s -> 54s]  afirmavam categoricamente que imprimir dinheiro não geraria inflação naquele momento, naquelas circunstâncias.
[54s -> 58s]  E a verdade é que não é tão simples responder essa pergunta, porque
[58s -> 65s]  imprimir dinheiro não necessariamente vai gerar inflação, depende de outros fatores, depende das circunstâncias.
[65s -> 69s]  Mas sim que imprimir dinheiro é sempre um fator inflacionário.

As you can see there was a part cut off at the beginning of my audio, in case you want to test my audio to see if you get my results: https://www.dropbox.com/s/m0q30hmzbx6mvt2/jota.mp3?dl=1 And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper? Thank you very much.

guillaumekln commented 1 year ago

Hi,

Thank you for sharing the input audio file. I can reproduce the output and will try to understand why there is a difference.

@ItakeLs It looks like you are facing the same issue?

And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper?

Yes, you can just copy the relevant functions from the original Whisper implementation. For example:

def format_timestamp(seconds, always_include_hours=False, decimal_marker="."):
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"

def write_srt(file, segments):
    for i, segment in enumerate(segments, start=1):
        start_time = format_timestamp(segment.start, always_include_hours=True, decimal_marker=",")
        end_time = format_timestamp(segment.end, always_include_hours=True, decimal_marker=",")

        file.write("%d\n" % i)
        file.write("%s --> %s\n" % (start_time, end_time))
        file.write(segment.text.strip().replace("-->", "->"))
        file.write("\n\n")

segments, _ = model.transcribe("jota.mp3", beam_size=5)
with open("audio.srt", "w") as srt_file:
    write_srt(srt_file, segments)

ItakeLs commented 1 year ago

Yes, this is the same issue that I am getting, I made a colab notebook to reproduce the error. I mentioned it in further detail in this comment.

I put the issue in the CTranslate2 because from my testing the error does not seem to be from the fast-whisper implementation, I did not examine the audio and feature extraction but it seems to be an issue from the generate() function.

guillaumekln commented 1 year ago

Indeed the issue is in CTranslate2. I have opened a merge request with the fix: https://github.com/OpenNMT/CTranslate2/pull/1081

With this change I get the expected output on the beginning of your audio file:

[0s -> 3s]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
[3s -> 5s]  É isso que a gente vai responder neste vídeo.
[5s -> 10s]  Música
[10s -> 13s]  Muito bem, todos aqueles que estão chegando agora aqui no canal, meu nome é Fernando Urch,
[13s -> 16s]  aqui a gente fala de economia, mercados e investimentos, se vocês gostarem do conteúdo,
[16s -> 21s]  considerem se inscrever, ativando o sininho aqui embaixo e também compartilhando este vídeo.
[21s -> 24s]  Pois o assunto de inflação é recorrente aqui no canal pela sua importância,
[24s -> 30s]  o impacto que tem na nossa vida financeira, profissional, na economia, na vida em sociedade.
[30s -> 34s]  E o debate em torno da relação entre impressão de moeda e inflação,
[34s -> 38s]  ele ressurge de tempos em tempos, como foi lá no início da pandemia,
[38s -> 42s]  quando muitos economistas, banqueiros centrais, políticos,
[42s -> 47s]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[47s -> 50s]  afirmavam categoricamente que imprimir dinheiro
[50s -> 54s]  não geraria inflação naquele momento, naquelas circunstâncias.
[54s -> 57s]  E a verdade é que não é tão simples responder essa pergunta,
[57s -> 62s]  porque imprimir dinheiro não necessariamente vai gerar inflação,
[62s -> 65s]  depende de outros fatores, depende das circunstâncias.

I will release a new version with the fix as soon as possible.

guillaumekln commented 1 year ago

@ubanning Can you update to ctranslate2>=3.5.1 and try again?

emhagman commented 1 year ago

@guillaumekln I had this issue as well. Do I need to run ct2-transformers-converter again on the whisper model or is updating ctranslate2 enough?

guillaumekln commented 1 year ago

Updating ctranslate2 is enough for this issue.

emhagman commented 1 year ago

Updating ctranslate2 is enough for this issue.

Great, thank you!

SYSTRAN / faster-whisper

A part of the beginning of my audio was cut #3