Closed ubanning closed 1 year ago
Hi,
Thank you for sharing the input audio file. I can reproduce the output and will try to understand why there is a difference.
@ItakeLs It looks like you are facing the same issue?
And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper?
Yes, you can just copy the relevant functions from the original Whisper implementation. For example:
def format_timestamp(seconds, always_include_hours=False, decimal_marker="."):
assert seconds >= 0, "non-negative timestamp expected"
milliseconds = round(seconds * 1000.0)
hours = milliseconds // 3_600_000
milliseconds -= hours * 3_600_000
minutes = milliseconds // 60_000
milliseconds -= minutes * 60_000
seconds = milliseconds // 1_000
milliseconds -= seconds * 1_000
hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
def write_srt(file, segments):
for i, segment in enumerate(segments, start=1):
start_time = format_timestamp(segment.start, always_include_hours=True, decimal_marker=",")
end_time = format_timestamp(segment.end, always_include_hours=True, decimal_marker=",")
file.write("%d\n" % i)
file.write("%s --> %s\n" % (start_time, end_time))
file.write(segment.text.strip().replace("-->", "->"))
file.write("\n\n")
segments, _ = model.transcribe("jota.mp3", beam_size=5)
with open("audio.srt", "w") as srt_file:
write_srt(srt_file, segments)
Yes, this is the same issue that I am getting, I made a colab notebook to reproduce the error. I mentioned it in further detail in this comment.
I put the issue in the CTranslate2 because from my testing the error does not seem to be from the fast-whisper implementation, I did not examine the audio and feature extraction but it seems to be an issue from the generate()
function.
Indeed the issue is in CTranslate2. I have opened a merge request with the fix: https://github.com/OpenNMT/CTranslate2/pull/1081
With this change I get the expected output on the beginning of your audio file:
[0s -> 3s] Afinal de contas, imprimir dinheiro gera ou não gera inflação?
[3s -> 5s] É isso que a gente vai responder neste vídeo.
[5s -> 10s] Música
[10s -> 13s] Muito bem, todos aqueles que estão chegando agora aqui no canal, meu nome é Fernando Urch,
[13s -> 16s] aqui a gente fala de economia, mercados e investimentos, se vocês gostarem do conteúdo,
[16s -> 21s] considerem se inscrever, ativando o sininho aqui embaixo e também compartilhando este vídeo.
[21s -> 24s] Pois o assunto de inflação é recorrente aqui no canal pela sua importância,
[24s -> 30s] o impacto que tem na nossa vida financeira, profissional, na economia, na vida em sociedade.
[30s -> 34s] E o debate em torno da relação entre impressão de moeda e inflação,
[34s -> 38s] ele ressurge de tempos em tempos, como foi lá no início da pandemia,
[38s -> 42s] quando muitos economistas, banqueiros centrais, políticos,
[42s -> 47s] de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[47s -> 50s] afirmavam categoricamente que imprimir dinheiro
[50s -> 54s] não geraria inflação naquele momento, naquelas circunstâncias.
[54s -> 57s] E a verdade é que não é tão simples responder essa pergunta,
[57s -> 62s] porque imprimir dinheiro não necessariamente vai gerar inflação,
[62s -> 65s] depende de outros fatores, depende das circunstâncias.
I will release a new version with the fix as soon as possible.
@ubanning Can you update to ctranslate2>=3.5.1
and try again?
@guillaumekln I had this issue as well. Do I need to run ct2-transformers-converter
again on the whisper model or is updating ctranslate2
enough?
Updating ctranslate2
is enough for this issue.
Updating
ctranslate2
is enough for this issue.
Great, thank you!
Hello, first of all thank you very much for your work on this project, it really was much faster and consumed less RAM and VRAM. I'm testing and unfortunately a significant part of my audio has been cut. My audio is in Portuguese and has 13 minutes, apparently the problem only occurred at the beginning of it. Is there a way to solve this problem? I used the following code:
The result I got running the standard version of Whisper on the same medium model:
The result I got running this faster version of Whisper:
As you can see there was a part cut off at the beginning of my audio, in case you want to test my audio to see if you get my results: https://www.dropbox.com/s/m0q30hmzbx6mvt2/jota.mp3?dl=1 And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper? Thank you very much.