Replicate Bazarr Whisper Functionality/Webhooks

McCloudS commented 10 months ago

I've attempted to replicate the existing capability from

https://wiki.bazarr.media/Additional-Configuration/Whisper-Provider/ which uses the code from

https://github.com/ahmetoner/whisper-asr-webservice/blob/51c6eceda0836d145048224693c69c2706d78f46/app/webservice.py#L61-L78

The code block called from Bazarr is

https://github.com/morpheus65535/bazarr/blob/a09cc34e09407b8a2338d1034de7f8ff8fc91b19/libs/subliminal_patch/providers/whisperai.py#L284-L295

The code block I came up with:

@app.route("/asr", methods=["POST"])
def asr():
    logging.debug("This hook is from asr webhook!")
    logging.debug("Headers: %s", request.headers)
    logging.debug("Raw response: %s", request.data)
    task = request.args.get("task", default="transcribe")
    language = request.args.get("language")
    initial_prompt = request.args.get("initial_prompt")
    encode = request.args.get("encode", type=bool, default=True)
    output = request.args.get("output", default="txt")
    word_timestamps = request.args.get("word_timestamps", type=bool, default=False)
    audio_file = request.files.get("audio_file")

    if audio_file:
        print(f"Transcribing file: {audio_file}")
        start_time = time.time()
        result = model.transcribe_stable(audio_file)
        result.to_srt_vtt("/tmp/" + audio_file + ".srt", word_level=word_level_highlight)
        elapsed_time = time.time() - start_time
        minutes, seconds = divmod(int(elapsed_time), 60)
        print(f"Transcription of {audio_file} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
        result = send_file("/tmp/" + audio_file + ".srt")

        # Return the result as a file download
        return send_file(
            result,
            as_attachment=True,
            download_name=f"{filename}.{output}",
            mimetype="text/plain",
            add_etags=False,
            conditional=True,
        )

    return "Audio file not provided."

In the Bazarr logs I'm getting 2023-10-24 22:22:49,787 - retry.api (14dee3942b38) : WARNING (api:40) - HTTPConnectionPool(host='192.168.111', port=8090): Max retries exceeded with url: /asr?task=transcribe&language=en&output=srt&encode=false (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x14dee0988610>, 'Connection to 192.168.111 timed out. (connect timeout=9000)')), retrying in 5 seconds...

I'm at a loss. I can't even get a webhook response from Bazarr when trying to manually generate a subtitle using Whisper as the provider. The machines are reachable to each other via pings. I'm sure I'm missing something simple.

dcava commented 10 months ago

Haven't looked at the API in detail - but the host address looks wrong? host='192.168.111'

McCloudS commented 10 months ago

Hah, yeah, thanks, I ended up finding that this morning after looking at it with fresh eyes. I'm moving to a FastAPI to mimic the whisper-asr container that should make it semi drop-in.

McCloudS commented 10 months ago

Just running into an issue that whisper doesn't like the spooledtemporaryfile (audio_file) that's returned from bazarr. I'm not sure how whisper-asr-webservice is able to pass it directly into the transcribe function.

@app.post("/asr")
async def asr(
        task: Union[str, None] = Query(default="transcribe", enum=["transcribe", "translate"]),
        language: Union[str, None] = Query(default=None),
        initial_prompt: Union[str, None] = Query(default=None),
        audio_file: UploadFile = File(...),
        encode: bool = Query(default=True, description="Encode audio first through ffmpeg"),
        output: Union[str, None] = Query(default="srt", enum=["txt", "vtt", "srt", "tsv", "json"]),
        word_timestamps: bool = Query(default=False, description="Word level timestamps")
):
    logging.debug("This hook is from Bazarr/ASR webhook!")
    global model
    try:
        print(f"Transcribing file from Bazarr")
        start_time = time.time()
        if model is None:
            logging.debug("Model was purged, need to re-create")
            model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions)
        tempfile = io.BytesIO(open(audio_file.file, "rb").read())
        result = model.transcribe_stable(audio_file.read(), task=transcribe_or_translate)
        result.to_srt_vtt(f"/tmp/{audio_file.filename}.{output}", word_level=word_level_highlight)
        elapsed_time = time.time() - start_time
        minutes, seconds = divmod(int(elapsed_time), 60)
        print(f" {audio_file.filename} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
    except Exception as e:
        print(f"Error processing or transcribing {audio_file.filename}: {e}")
    finally:
        if len(files_to_transcribe) == 0:
            logging.debug("Queue is empty, clearing/releasing VRAM")
            del model
            gc.collect()

    return StreamingResponse(
        io.BytesIO(open(f"/tmp/{audio_file.filename}.{output}", "rb").read()),
        media_type="text/plain",
        headers={
            #'Asr-Engine': ASR_ENGINE,
            'Content-Disposition': f'attachment; filename="/tmp/{audio_file.filename}.{output}"'
        })

dcava commented 10 months ago

It looks like whisper-asr does some pre-packaging/processing for audio:

def load_audio(file: BinaryIO, encode=True, sr: int = SAMPLE_RATE):
    """
    Open an audio file object and read as mono waveform, resampling as necessary.
    Modified from https://github.com/openai/whisper/blob/main/whisper/audio.py to accept a file object
    Parameters
    ----------
    file: BinaryIO
        The audio file like object
    encode: Boolean
        If true, encode audio stream to WAV before sending to whisper
    sr: int
        The sample rate to resample the audio if necessary
    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """

and then passes the numpy object around:

result = transcribe(load_audio(audio_file.file, encode), task, language, initial_prompt, word_timestamps, output)

Where as you seem to passing the file-path and stable-ts does the work. Not sure if that makes any difference!

McCloudS commented 10 months ago

Thanks. Even replicating load_audio, I get similar issues. What's puzzling to me, is faster-whisper (and stable-ts) use pyav to ingest anything and resample as necessary. Even when I attempt to convert audio_file to bytes (which stable-ts and faster-whisper) will accept as an argument, I still get odd errors.

@app.post("/asr")
async def asr(
        task: Union[str, None] = Query(default="transcribe", enum=["transcribe", "translate"]),
        language: Union[str, None] = Query(default=None),
        initial_prompt: Union[str, None] = Query(default=None),
        audio_file: UploadFile = File(...),
        encode: bool = Query(default=True, description="Encode audio first through ffmpeg"),
        output: Union[str, None] = Query(default="srt", enum=["txt", "vtt", "srt", "tsv", "json"]),
        word_timestamps: bool = Query(default=False, description="Word level timestamps")
):
    logging.debug("This hook is from Bazarr/ASR webhook!")
    try:
        print(f"Transcribing file from Bazarr")
        start_time = time.time()
        logging.debug("Model was purged, need to re-create")
        model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions)
        print(f"File type is: {type(audio_file)}")
        print(f"File name is: {audio_file.filename}")
        print("File content type is: {audio_file.content_type}")
        import shutil
        audio_file.file.seek(0)
        with open("./audio_file.wav", "wb") as new_file:
            shutil.copyfileobj(load_audio(audio_file.file, encode), new_file)
        audio_file.file.seek(0)
        result = model.transcribe_stable(load_audio(audio_file.file, encode), task=transcribe_or_translate)
        result.to_srt_vtt(f"/tmp/{audio_file.filename}.{output}", word_level=word_level_highlight)
        elapsed_time = time.time() - start_time
        minutes, seconds = divmod(int(elapsed_time), 60)
        print(f" {audio_file.filename} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
    except Exception as e:
        print(f"Error processing or transcribing {audio_file.filename}: {e}")

    return StreamingResponse(
        io.BytesIO(open(f"/tmp/{audio_file.filename}.{output}", "rb").read()),
        media_type="text/plain",
        headers={
            #'Asr-Engine': ASR_ENGINE,
            'Content-Disposition': f'attachment; filename="/tmp/{audio_file.filename}.{output}"'
        })

def load_audio(file: BinaryIO, encode=True, sr: int = 16000):
    """
    Open an audio file object and read as mono waveform, resampling as necessary.
    Modified from https://github.com/openai/whisper/blob/main/whisper/audio.py to accept a file object
    Parameters
    ----------
    file: BinaryIO
        The audio file like object
    encode: Boolean
        If true, encode audio stream to WAV before sending to whisper
    sr: int
        The sample rate to resample the audio if necessary
    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    if encode:
        try:
            # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
            # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
            out, _ = (
                ffmpeg.input("pipe:", threads=0)
                .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
                .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=file.read())
            )
        except ffmpeg.Error as e:
            raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
    else:
        out = file.read()

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

File type is: <class 'starlette.datastructures.UploadFile'>
File name is: audio_file
File content type is: {audio_file.content_type}
Error processing or transcribing audio_file: [input_sr] is required when [audio] is a PyTorch tensor or NumPy array.

McCloudS commented 10 months ago

Or even trying:

        print(f"File content type convert is: {type(audio_file.file.read())}")
        result = model.transcribe_stable(audio_file.file.read(), task=transcribe_or_translate)

Gives:

File type is: <class 'starlette.datastructures.UploadFile'>
File name is: audio_file
File content type is: None
File content type convert is: <class 'bytes'>
Error processing or transcribing audio_file: [Errno 1094995529] Invalid data found when processing input: '<none>'

McCloudS commented 10 months ago

Getting peculiar data. When bazarr sends the audio, it appears incomplete. When I ffprobe any file I get from it, it returns, "Invalid data found when processing input". When I send my own audio via postman or python, it comes through fine and transcribes... I have it dumping the received file in the script folder to look at.

@app.post("/asr")
async def asr(
        task: Union[str, None] = Query(default="transcribe", enum=["transcribe", "translate"]),
        language: Union[str, None] = Query(default=None),
        initial_prompt: Union[str, None] = Query(default=None),
        audio_file: UploadFile = File(...),
        encode: bool = Query(default=True, description="Encode audio first through ffmpeg"),
        output: Union[str, None] = Query(default="srt", enum=["txt", "vtt", "srt", "tsv", "json"]),
        word_timestamps: bool = Query(default=False, description="Word level timestamps")
):
    logging.debug("This hook is from Bazarr/ASR webhook!")
    try:
        print(f"Transcribing file from Bazarr")
        start_time = time.time()
        logging.debug("Model was purged, need to re-create")
        model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions)
        print(f"File type is: {type(audio_file)}")
        print(f"File name is: {audio_file.filename}")
        print(f"File content type is: {type(audio_file.content_type)}")
        print(f"File content type convert is: {type(audio_file.file.read())}")
        import shutil
        audio_file.file.seek(0)
        with open("./audio_file.wav", "wb") as new_file:
            shutil.copyfileobj(audio_file.file, new_file)
        audio_file.file.seek(0)
        result = model.transcribe_stable(audio_file.file.read(), task=transcribe_or_translate)
        result.to_srt_vtt(f"/tmp/{audio_file.filename}.{output}", word_level=word_level_highlight)
        elapsed_time = time.time() - start_time
        minutes, seconds = divmod(int(elapsed_time), 60)
        print(f" {audio_file.filename} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
    except Exception as e:
        print(f"Error processing or transcribing {audio_file.filename}: {e}")

    return StreamingResponse(
        io.BytesIO(open(f"/tmp/{audio_file.filename}.{output}", "rb").read()),
        media_type="text/plain",
        headers={
            #'Asr-Engine': ASR_ENGINE,
            'Content-Disposition': f'attachment; filename="/tmp/{audio_file.filename}.{output}"'
        })

McCloudS commented 10 months ago

I'm at a standstill if anyone wants to take a crack at it. You'll have to import a couple more libs to run what I have above, but it should be obvious on run.

McCloudS commented 10 months ago

After a bit of hairpulling finally traced the error to stable-ts expecting numpy arrays to provide the sample rate to the model. sr_input=16000. Should have a release up in a bit.

McCloudS commented 10 months ago

Done. See https://github.com/McCloudS/subgen/edit/main/README.md#bazarr

Tried a couple runs and it seems to work, let me know. I'll leave this open for comments.

dcava commented 10 months ago

Nice job. Works for me - apart from trying to melt my Synology ds920+ CPU!

Using medium.en model, took about 14mins to trans a 7min short film.

Thanks again.

McCloudS / subgen

Replicate Bazarr Whisper Functionality/Webhooks #16