Version 1.1.0 has onnxruntime thread affinity crash

Appfinity-development commented 6 days ago

Updated from 1.0.3 to 1.1.0. Now an onnxruntime thread affinity crash occurs each time. Both versions run on a Nvidia A40 with 4 CPU cores, 48GB VRAM and 16GB RAM (on a private Replicate server). Shouldn't be a hardware issue. Our model config:

  self.whisper_model = WhisperModel(
            "large-v2",
            device="cuda",
            compute_type="float16",
            cpu_threads=4,
            num_workers=1
        )

        ...

         options = dict(
            vad_filter=True,
            vad_parameters=dict(min_silence_duration_ms=1000),
            initial_prompt=prompt,
            word_timestamps=True,
            language=language,
            log_progress=True,
            hotwords=prompt
        )

        segments, transcript_info = self.whisper_model.transcribe(audio=audio_file, **options)

Also tried this:

import os
os.environ["ORT_DISABLE_CPU_AFFINITY"] = "1"
os.environ["OMP_NUM_THREADS"] = "4"
os.environ["OPENBLAS_NUM_THREADS"] = "4"
os.environ["MKL_NUM_THREADS"] = "4"
os.environ["VECLIB_MAXIMUM_THREADS"] = "4"
os.environ["NUMEXPR_NUM_THREADS"] = "4"

But to no avail. Any suggestions? Below the crash log.

Loading large-v2 model... Done loading large-v2 model, took: 75.503 seconds Starting transcribing INFO:faster_whisper:Processing audio with duration 03:25.706 2024-11-22 19:33:53.322733977 [E:onnxruntime:Default, env.cc:234 ThreadMain] pthread_setaffinity_np failed for thread: 785, index: 1, mask: {2, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set. INFO:faster_whisper:VAD filter removed 00:19.722 of audio DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.048 -> 01:07.440], [01:07.984 -> 03:06.576] 0%| | 0/185.98 [00:00<?, ?seconds/s]DEBUG:faster_whisper:Processing segment at 00:00.000 Traceback (most recent call last): File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/server/runner.py", line 417, in _handle_done f.result() File "/root/.pyenv/versions/3.10.15/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/root/.pyenv/versions/3.10.15/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception cog.server.exceptions.FatalWorkerException: Prediction failed for an unknown reason. It might have run out of memory? (exitcode -6)

The cog.yaml with dependencies looks like this:

build:
  gpu: true
  system_packages:
    - "ffmpeg"
    - "libmagic1"
  python_version: "3.10"
  python_packages:
    # Core ML packages
    - "torch==2.3.0"
    - "torchaudio==2.3.0"
    - "faster-whisper==1.1.0"
    - "pyannote-audio==3.3.1"
    - "onnxruntime"

    # API and utility packages
    - "requests==2.31.0"
    - "firebase-admin==6.4.0"
    - "google-generativeai==0.3.2"
    - "babel==2.14.0"
    - "openai==1.12.0"
    - "supabase==2.10.0"
    - "kalyke-apns==1.0.3"
    - "numpy<2.0.0"

  run:
    - "pip install --upgrade pip"
    - "echo env is ready!"

predict: "predict.py:Predictor"

Also tried removing the onnxruntime dependency or setting it to a specific gpu version. But nothing fixes the issue. Anyone with ideas (@MahmoudAshraf97) ?

If the cpu is used as device on WhisperModel the onnxruntime error still shows in the logs but there is no crash and transcribing finishes successfully.

MahmoudAshraf97 commented 6 days ago

Can you limit the number of threads here and try again? https://github.com/SYSTRAN/faster-whisper/blob/97a4785fa13d067c300f8b6e40c4381ad0381c02/faster_whisper/vad.py#L263:L264

Appfinity-development commented 6 days ago

Which API is available to set SileroVADModel SessionOptions parameters?

Purfview commented 6 days ago

Which API is available to set SileroVADModel SessionOptions parameters?

Just change it in vad.py to:

        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1

Appfinity-development commented 4 days ago

Im running the code on a docker environment which just pulls in faster_whisper package from PyPi. So local changes I make in Pycharm to package won't propagate to the Replicate server. Only 2 options I see is monkey patching or forking the whole lib. Both which I'm not really keen on doing..

Or am I missing a third option?

MahmoudAshraf97 commented 4 days ago

No third option currently, I just want you to test the fix first before we actually take any steps to fix

Appfinity-development commented 1 day ago

Tried monkey patching, this does remove the onnxruntime error but the OOM error still persisted. It turned out to be ctranslate2 version 4.5.0 was incompatible with the cog docker env of replicate. After downgrading to 4.4.0 it worked again. I did however keep the monkey patch since the logs won't be polluted then and the error seems something that should be addressed in 1.1.1.

Im now using large-v2 with the BatchedInferencePipeline which speeds up the processing time around 2x. Very nice for the same model.

This is my current packages in case someone else runs into the issue:

    - "torch==2.3.0"
    - "torchaudio==2.3.0"
    - "faster-whisper==1.1.0"
    - "pyannote-audio==3.3.2"
    - "ctranslate2==4.4.0"

monkey patch:

import faster_whisper.vad
from faster_whisper.vad import SileroVADModel

# to prevent "Invalid argument. Specify the number of threads explicitly so the affinity is not set" onnxruntime error

class PatchedSileroVADModel(SileroVADModel):
    def __init__(self, encoder_path, decoder_path):
        try:
            import onnxruntime
        except ImportError as e:
            raise RuntimeError(
                "Applying the VAD filter requires the onnxruntime package"
            ) from e

        # Custom modification for SessionOptions
        opts = onnxruntime.SessionOptions()
        opts.inter_op_num_threads = 4
        opts.intra_op_num_threads = 4
        opts.log_severity_level = 3

        # Initialize sessions with modified options
        self.encoder_session = onnxruntime.InferenceSession(
            encoder_path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )
        self.decoder_session = onnxruntime.InferenceSession(
            decoder_path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )

faster_whisper.vad.SileroVADModel = PatchedSileroVADModel

Purfview commented 1 day ago

I think it should be

        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1

MahmoudAshraf97 commented 1 day ago

I think it should be

        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1

the error he's mentioning is only caused when the value is 0 since that means onnx must infer the actual number and it fails to do so, any fixed number should fix the error, setting it to 1 should be the safest but not the fastest

Also VAD encoder now benefits from GPU acceleration if anyone needs it

SYSTRAN / faster-whisper

Version 1.1.0 has onnxruntime thread affinity crash #1169