distil-large-v3 is fast but exports wrong language / large-v3 is slow

Marcophono2 commented 3 months ago

Hello!

I am impressed by around 400 token/s if using the distil-large-v3 model. Unfortunatelly it outputs the english translation instead of the german original. The info tells me that german was detected with 100% probability. Obviously the model understands english because it translated it to english absolutelly correct.

Using the standard OpenAI model large-v3 is as slow as whisperX and comes to around 180 token/s. Any suggestions?

Ubuntu 23.04, RTX 4090

import whisperx
import io
import time
import base64
import subprocess
import numpy as np
import torch
import torch.nn as nn

import os
device = "cuda"
compute_type = "float16"

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("distil-large-v3", device="cuda:0", compute_type="float16") 

batched_model = BatchedInferencePipeline(model=model)

segments, info = batched_model.model.transcribe("/home/marc/Desktop/AI/whatsapp/audios/491754572379.mp3", beam_size=1, language="de", condition_on_previous_text=False)

for segment in segments:
    #print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
    alles=alles+" "+segment.text    

print(alles)

trungkienbkhn commented 3 months ago

@Marcophono2 , unfortunately distil-large-v3 model currently only supports English

Marcophono2 commented 3 months ago

Thank you @trungkienbkhn . But strange anyway. It can hear german input but cannot speak german. :-) Anyway, do you have a hint for me why using the standard model large-v3 doesn't have any performance advantage compared to whisperX ?

trungkienbkhn commented 3 months ago

From distil-whisper model docs:

Note: Distil-Whisper is currently only available for English speech recognition. We are working with the community to distill Whisper on other languages. If you are interested in distilling Whisper in your language, check out the provided training code. We will soon update the repository with multilingual checkpoints when ready!

You can refer to this PR for FW acceleration and further performance improvements.

Marcophono2 commented 3 months ago

I tried it already, @trungkienbkhn but without effect. Meanwhile I found a model to output the german aoudio input in german aswell at 530 tokens/second. https://huggingface.co/primeline/distil-whisper-large-v3-german

Thank you for your support! I will keep a foot in this repo.

SYSTRAN / faster-whisper

distil-large-v3 is fast but exports wrong language / large-v3 is slow #874