SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.21k stars 1.02k forks source link

Better chunking/loading #968

Open KTibow opened 2 months ago

KTibow commented 2 months ago

I'd like to transcribe an 8 hour audio file. My generally capable computer literally crashes if I try to load it right now.

This could be fixed by loading bits at a time.

edit: Got it to work by using my custom code to load it into a tensor.

from torchaudio.io import StreamReader
from tqdm import tqdm
import torch
def decode_audio(
    input_file: str,
    sampling_rate: int = 16000,
):
    print("running...")
    stream = StreamReader(input_file)

    if stream.num_src_streams == 0:
        raise RuntimeError("No audio stream found in the input.")

    stream.add_basic_audio_stream(
        frames_per_chunk=sampling_rate,
        sample_rate=sampling_rate,
    )

    # Stream the audio
    chunks = []
    for chunk in tqdm(stream.stream(), total=30 * 60): # replace 30 * 60 with how many seconds you want to load
        chunks.append(chunk[0].mean(1))
        if len(chunks) > 30 * 60: break

    # Create waveform
    waveform = torch.cat(chunks, 0)
    return waveform
MahmoudAshraf97 commented 2 months ago

Can you upload the file to reproduce? All steps are generally done in chunks so which step crashes exactly?

KTibow commented 2 months ago

Turns out that this was actually a problem with torchaudio. If I find a way to load it that doesn't crash I'll send in a PR.

MahmoudAshraf97 commented 2 months ago

No need, we're already replacing torch audio