SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.35k stars 948 forks source link

Faster whisper holding memory not releasing it. Killing the flask server #660

Open naveen-medloop opened 7 months ago

naveen-medloop commented 7 months ago

Issue quick look

I am running a faster whisper on a Flask app to use it for real-time transcription. It is running, occupying memory, and not releasing it. Eventually, the app is getting killed. I tried clearing the Torch cache and using the garbage collector, but nothing worked.

Detail info

Hi, I am running Faster Whisper on a Flask app to implement real-time transcription in my application.

I am runing the app on Deep Learning OSS Nvidia Driver AMI GPU PyTorch 1.13.1 (Amazon Linux 2)

NVIDIA driver version: 535.129.03 CUDA version: 11.7 python 3.10.2

here my code

import tempfile
import os
from faster_whisper import WhisperModel
#import torch
import gc

def transcribe_audio(audio_file):
    try:
        # Read the contents of the FileStorage object   
        audio_data = audio_file.read()

        #storing it in the temp file
        with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as temp_audio_file:
            temp_audio_file.write(audio_data)
        audio_file_path = temp_audio_file.name

        # for better performance we can use device="cuda" and compute_type ref: https://opennmt.net/CTranslate2/quantization.html
        model = WhisperModel("large-v3", device="cuda", compute_type="int8", cpu_threads=8)
        segments, info = model.transcribe(audio = audio_file_path, beam_size=5)

        print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

        output_data = []
        for segment in segments:
            output_entry = {
                'id': segment.id,
                'start': f"{int(segment.start // 3600):02d}:{int((segment.start % 3600) // 60):02d}:{int(segment.start % 60):02d}",
                'end': f"{int(segment.end // 3600):02d}:{int((segment.end % 3600) // 60):02d}:{int(segment.end % 60):02d}",
                'text': segment.text
            }
            output_data.append(output_entry)
        #torch.cuda.empty_cache()
        gc.collect()
        return output_data
    except Exception as ex:
        print("Exception occurred at transcribe_audio", ex)
    finally:

        # Remove the temporary audio file
        temp_audio_file.close()
        os.remove(audio_file_path)

When running the server, for each call, it occupies some space in memory and does not release it. Eventually, it gets killed. Even though I have placed try-catch blocks, no exceptions are printed. When the app is getting killed, the entire memory is released.

server logs

[2024-01-28 18:39:45.677] [ctranslate2] [thread 17471] [info] Loaded model /home/ec2-user/.cache/huggingface/hub/models--Systran--faster-whisper-large-v3/snapshots/edaa852ec7e145841d8ffdb056a99866b5f0a478 on device cuda:0
[2024-01-28 18:39:45.677] [ctranslate2] [thread 17471] [info]  - Binary version: 6
[2024-01-28 18:39:45.677] [ctranslate2] [thread 17471] [info]  - Model specification revision: 3
[2024-01-28 18:39:45.677] [ctranslate2] [thread 17471] [info]  - Selected compute type: int8_float16
INFO:faster_whisper:Processing audio with duration 00:00.960
INFO:faster_whisper:Detected language 'en' with probability 0.99
Detected language 'en' with probability 0.988770
INFO:werkzeug:49.43.234.138 - - [28/Jan/2024 18:39:46] "POST /api/transcribe HTTP/1.1" 200 -
INFO:root:/api/transcribe called
Data received <class 'werkzeug.datastructures.file_storage.FileStorage'>
[2024-01-28 18:39:51.312] [ctranslate2] [thread 17535] [info] Loaded model /home/ec2-user/.cache/huggingface/hub/models--Systran--faster-whisper-large-v3/snapshots/edaa852ec7e145841d8ffdb056a99866b5f0a478 on device cuda:0
[2024-01-28 18:39:51.313] [ctranslate2] [thread 17535] [info]  - Binary version: 6
[2024-01-28 18:39:51.313] [ctranslate2] [thread 17535] [info]  - Model specification revision: 3
[2024-01-28 18:39:51.313] [ctranslate2] [thread 17535] [info]  - Selected compute type: int8_float16
ERROR:libav.matroska,webm:Found unknown-length element with ID 0x18538067 at pos. 0x3ec1 for which no syntax for parsing is available.
INFO:faster_whisper:Processing audio with duration 00:01.920
INFO:faster_whisper:Detected language 'en' with probability 0.99
Detected language 'en' with probability 0.992188
INFO:werkzeug:49.43.234.138 - - [28/Jan/2024 18:39:52] "POST /api/transcribe HTTP/1.1" 200 -
INFO:root:/api/transcribe called
Data received <class 'werkzeug.datastructures.file_storage.FileStorage'>
Killed

CMD to run server CT2_VERBOSE=1 python3.10 -m flask run --host=0.0.0.0 -p 3000

Here I am attaching images of server memory usage before crashing and after crashing. after the app is crashed whole memory is getting released

image (44) image (45)

Vidoe https://drive.google.com/file/d/1SfxOLEnwCVH9AzRS_mtg1V8bmpiZdTfI/view

I have tried the following solutions, but none of them worked:

  1. Clearing the Torch cache.
  2. Running the garbage collector after every call.
  3. Running Faster Whisper in a child processor so that killing the process results in releasing the memory. However, this approach is raising errors, so I am unable to implement it.

Please suggest a way to fix this issue.

Sharrnah commented 7 months ago

Same issue here.

Compared it with the official whisper implementation.

Though not sure if its related to faster whisper or if its something directly in CTranslate2

BBC-Esq commented 7 months ago

I am running a faster whisper on a Flask app to use it for real-time transcription. It is running, occupying memory, and not releasing it. Eventually, the app is getting killed. I tried clearing the Torch cache and using the garbage collector, but nothing worked.

. . .

I don't know if this helps. I'm not an experienced programmer...but here's one suggestion that ChatGPT said:

(1)

Repeated Model Loading: Each transcription request seems to load the model anew, as indicated by the logs: [info] Loaded model ... on device cuda:0 This repeated loading can significantly consume memory, especially on a GPU. Consider loading the model once at the start of your application and then reusing it for each transcription request.

. . .

(2)

CUDA Memory Management: CUDA memory can be fragmented over time, especially with operations that allocate and deallocate memory repeatedly. While not directly shown in your script, if there are other parts of your application that manipulate CUDA memory, consider reviewing them as well.

This is just one suggestion but I haven't tried it...

(3)

Model Initialization: Move the model initialization (WhisperModel("large-v3", device="cuda", compute_type="int8", cpu_threads=8)) outside the request handling function, so it is done once. This can prevent the repeated loading and initialization overhead for each request.

However, I can provide you with a function that I put in a utilities script, and then call throughout my various scripts if I sense a memory issue...

def print_cuda_memory_usage():
    '''
    from utilities import print_cuda_memory_usage
    print_cuda_memory_usage()
    '''
    try:
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)

        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        print(f"Memory Total: {memory_info.total / 1024**2} MB") 
        print(f"Memory Used: {memory_info.used / 1024**2} MB")
        print(f"Memory Free: {memory_info.free / 1024**2} MB")

    except Exception as e:
        print(f"An error occurred: {e}")

    finally:
        pynvml.nvmlShutdown()

This is the version of the library that I'm using:

pip install nvidia-ml-py==12.535.108

Only works if you're using an NVIDIA gpu obviously...

If you feel like sharing, I'd be interested in your entire script if that's not was posted above. Hope this helps!

BBC-Esq commented 7 months ago

One additional suggestion...If you don't alter how the model is created each time...you might consider deleting the "model" object each time...like where you said you did garbage collection after each call to troubleshoot.

Here's a snippet from one of my scripts of deleting the model and then doing garbarge collection, clearing nvidia memory, as much crap as I could throw in there to free up memory. ;-)

This example is in my voice transcriber script:

    def ReleaseTranscriber(self):
        if hasattr(self.model, 'model'):
            del self.model.model
        if hasattr(self.model, 'feature_extractor'):
            del self.model.feature_extractor
        if hasattr(self.model, 'hf_tokenizer'):
            del self.model.hf_tokenizer
        del self.model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()
        my_cprint("Whisper model removed from memory.", 'red')

I haven't noticed with faster-whisper that random objects are created that you can't "del" like with other libraries - e.g. some embedding or vision models. So if you do decide to delete the "model" object after every call I would think you'd be fine, that is, unless you're a customer call center with non-stop transcription tasks.

With that being said, if you want to check out my full script it's located here:

https://github.com/BBC-Esq/ChromaDB-Plugin-for-LM-Studio/blob/main/src/vision_module.py

I also use faster-whisper for the transcribe file functionality of my program, so feel free to check out those scripts too if you want. Good luck!

lparisi commented 5 months ago

Im having a similar issue that it seems to be related to ctranslate2. Running 4xH100 80Gb and running faster-whisper on top of Ray. every 20k files transcribed I would get a random OOM, even if memory consumption over time is below 80% on all gpus, all the time.

ctranslate2::cuda::cudaasyncallocator::free() is the culprit here.

Added a logic to remove the actors, reinstantiate the models and carry on.

lparisi commented 5 months ago
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:104: Stack trace: 
(Transcriber pid=3623856)  /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f9fd4223c9a] ray::operator<<()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f9fd42263d8] ray::TerminateHandler()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(+0xb643c) [0x7f9fd30fd43c] __cxxabiv1::__terminate()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(+0xb57ff) [0x7f9fd30fc7ff] __cxa_call_terminate
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x356) [0x7f9fd30fd07f] __gxx_personality_v0
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libgcc_s.so.1(+0x12743) [0x7f9fd303e743] _Unwind_RaiseException_Phase2
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libgcc_s.so.1(_Unwind_RaiseException+0xf1) [0x7f9fd303eae5] _Unwind_RaiseException
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(__cxa_throw+0x46) [0x7f9fd30fd673] __cxa_throw
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate24cuda18CudaAsyncAllocator4freeEPvi+0x111) [0x7f7401b93ce1] ctranslate2::cuda::CudaAsyncAllocator::free()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate211StorageView7releaseEv+0x1f) [0x7f7401b2e2ef] ctranslate2::StorageView::release()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate211StorageViewD1Ev+0x9) [0x7f7401b2e379] ctranslate2::StorageView::~StorageView()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(+0x2c7314) [0x7f7401ae4314] ctranslate2::ReplicaPool<>::BatchJob<>::run()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate26Worker3runERNS_8JobQueueE+0x84) [0x7f7401b39124] ctranslate2::Worker::run()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(+0xd3e95) [0x7f9fd311ae95] execute_native_thread_routine
(Transcriber pid=3623856) /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9fd4f71ac3]
(Transcriber pid=3623856) /lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f9fd5002a04] __clone
(Transcriber pid=3623856) 
(Transcriber pid=3623856) *** SIGABRT received at time=1711058388 on cpu 93 ***
(Transcriber pid=3623856) PC: @     0x7f9fd4f739fc  (unknown)  pthread_kill
(Transcriber pid=3623856)     @     0x7f9fd4f1f520  (unknown)  (unknown)
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:361: *** SIGABRT received at time=1711058388 on cpu 93 ***
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:361: PC: @     0x7f9fd4f739fc  (unknown)  pthread_kill
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:361:     @     0x7f9fd4f1f520  (unknown)  (unknown)
(Transcriber pid=3623856) Fatal Python error: Aborted
priyakeerthi commented 3 months ago

@lparisi

Added a logic to remove the actors, reinstantiate the models and carry on.

it would be greatful if you could let me know the fix. Thanks!

Sharrnah commented 3 months ago

If i understand it correctly, he is just removing the models from memory and reloading it. Thats not really a solution but a workaround as that introduces a huge delay until it can transcribe again.

So maybe something like this here: https://github.com/SYSTRAN/faster-whisper/issues/660#issuecomment-1924938848

And question is: When do you reload the model? Do you check the memory consumption of the python script? Or just every 100 transcriptions?