Open naveen-medloop opened 7 months ago
Same issue here.
Compared it with the official whisper implementation.
Though not sure if its related to faster whisper or if its something directly in CTranslate2
I am running a faster whisper on a Flask app to use it for real-time transcription. It is running, occupying memory, and not releasing it. Eventually, the app is getting killed. I tried clearing the Torch cache and using the garbage collector, but nothing worked.
. . .
I don't know if this helps. I'm not an experienced programmer...but here's one suggestion that ChatGPT said:
(1)
Repeated Model Loading: Each transcription request seems to load the model anew, as indicated by the logs: [info] Loaded model ... on device cuda:0 This repeated loading can significantly consume memory, especially on a GPU. Consider loading the model once at the start of your application and then reusing it for each transcription request.
. . .
(2)
CUDA Memory Management: CUDA memory can be fragmented over time, especially with operations that allocate and deallocate memory repeatedly. While not directly shown in your script, if there are other parts of your application that manipulate CUDA memory, consider reviewing them as well.
This is just one suggestion but I haven't tried it...
(3)
Model Initialization: Move the model initialization (WhisperModel("large-v3", device="cuda", compute_type="int8", cpu_threads=8)) outside the request handling function, so it is done once. This can prevent the repeated loading and initialization overhead for each request.
However, I can provide you with a function that I put in a utilities script, and then call throughout my various scripts if I sense a memory issue...
def print_cuda_memory_usage():
'''
from utilities import print_cuda_memory_usage
print_cuda_memory_usage()
'''
try:
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"Memory Total: {memory_info.total / 1024**2} MB")
print(f"Memory Used: {memory_info.used / 1024**2} MB")
print(f"Memory Free: {memory_info.free / 1024**2} MB")
except Exception as e:
print(f"An error occurred: {e}")
finally:
pynvml.nvmlShutdown()
This is the version of the library that I'm using:
pip install nvidia-ml-py==12.535.108
Only works if you're using an NVIDIA gpu obviously...
If you feel like sharing, I'd be interested in your entire script if that's not was posted above. Hope this helps!
One additional suggestion...If you don't alter how the model is created each time...you might consider deleting the "model" object each time...like where you said you did garbage collection after each call to troubleshoot.
Here's a snippet from one of my scripts of deleting the model and then doing garbarge collection, clearing nvidia memory, as much crap as I could throw in there to free up memory. ;-)
This example is in my voice transcriber script:
def ReleaseTranscriber(self):
if hasattr(self.model, 'model'):
del self.model.model
if hasattr(self.model, 'feature_extractor'):
del self.model.feature_extractor
if hasattr(self.model, 'hf_tokenizer'):
del self.model.hf_tokenizer
del self.model
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
my_cprint("Whisper model removed from memory.", 'red')
I haven't noticed with faster-whisper that random objects are created that you can't "del" like with other libraries - e.g. some embedding or vision models. So if you do decide to delete the "model" object after every call I would think you'd be fine, that is, unless you're a customer call center with non-stop transcription tasks.
With that being said, if you want to check out my full script it's located here:
https://github.com/BBC-Esq/ChromaDB-Plugin-for-LM-Studio/blob/main/src/vision_module.py
I also use faster-whisper for the transcribe file functionality of my program, so feel free to check out those scripts too if you want. Good luck!
Im having a similar issue that it seems to be related to ctranslate2. Running 4xH100 80Gb and running faster-whisper on top of Ray. every 20k files transcribed I would get a random OOM, even if memory consumption over time is below 80% on all gpus, all the time.
ctranslate2::cuda::cudaasyncallocator::free()
is the culprit here.
Added a logic to remove the actors, reinstantiate the models and carry on.
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:104: Stack trace:
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f9fd4223c9a] ray::operator<<()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f9fd42263d8] ray::TerminateHandler()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(+0xb643c) [0x7f9fd30fd43c] __cxxabiv1::__terminate()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(+0xb57ff) [0x7f9fd30fc7ff] __cxa_call_terminate
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x356) [0x7f9fd30fd07f] __gxx_personality_v0
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libgcc_s.so.1(+0x12743) [0x7f9fd303e743] _Unwind_RaiseException_Phase2
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libgcc_s.so.1(_Unwind_RaiseException+0xf1) [0x7f9fd303eae5] _Unwind_RaiseException
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(__cxa_throw+0x46) [0x7f9fd30fd673] __cxa_throw
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate24cuda18CudaAsyncAllocator4freeEPvi+0x111) [0x7f7401b93ce1] ctranslate2::cuda::CudaAsyncAllocator::free()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate211StorageView7releaseEv+0x1f) [0x7f7401b2e2ef] ctranslate2::StorageView::release()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate211StorageViewD1Ev+0x9) [0x7f7401b2e379] ctranslate2::StorageView::~StorageView()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(+0x2c7314) [0x7f7401ae4314] ctranslate2::ReplicaPool<>::BatchJob<>::run()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/lib/python3.10/site-packages/ctranslate2/../ctranslate2.libs/libctranslate2-db5a9b25.so.4.1.0(_ZN11ctranslate26Worker3runERNS_8JobQueueE+0x84) [0x7f7401b39124] ctranslate2::Worker::run()
(Transcriber pid=3623856) /root/miniconda3/envs/ray-env/bin/../lib/libstdc++.so.6(+0xd3e95) [0x7f9fd311ae95] execute_native_thread_routine
(Transcriber pid=3623856) /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9fd4f71ac3]
(Transcriber pid=3623856) /lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f9fd5002a04] __clone
(Transcriber pid=3623856)
(Transcriber pid=3623856) *** SIGABRT received at time=1711058388 on cpu 93 ***
(Transcriber pid=3623856) PC: @ 0x7f9fd4f739fc (unknown) pthread_kill
(Transcriber pid=3623856) @ 0x7f9fd4f1f520 (unknown) (unknown)
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:361: *** SIGABRT received at time=1711058388 on cpu 93 ***
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:361: PC: @ 0x7f9fd4f739fc (unknown) pthread_kill
(Transcriber pid=3623856) [2024-03-21 21:59:48,340 E 3623856 3632701] logging.cc:361: @ 0x7f9fd4f1f520 (unknown) (unknown)
(Transcriber pid=3623856) Fatal Python error: Aborted
@lparisi
Added a logic to remove the actors, reinstantiate the models and carry on.
it would be greatful if you could let me know the fix. Thanks!
If i understand it correctly, he is just removing the models from memory and reloading it. Thats not really a solution but a workaround as that introduces a huge delay until it can transcribe again.
So maybe something like this here: https://github.com/SYSTRAN/faster-whisper/issues/660#issuecomment-1924938848
And question is: When do you reload the model? Do you check the memory consumption of the python script? Or just every 100 transcriptions?
Issue quick look
I am running a faster whisper on a Flask app to use it for real-time transcription. It is running, occupying memory, and not releasing it. Eventually, the app is getting killed. I tried clearing the Torch cache and using the garbage collector, but nothing worked.
Detail info
Hi, I am running Faster Whisper on a Flask app to implement real-time transcription in my application.
I am runing the app on Deep Learning OSS Nvidia Driver AMI GPU PyTorch 1.13.1 (Amazon Linux 2)
NVIDIA driver version: 535.129.03 CUDA version: 11.7 python 3.10.2
here my code
When running the server, for each call, it occupies some space in memory and does not release it. Eventually, it gets killed. Even though I have placed try-catch blocks, no exceptions are printed. When the app is getting killed, the entire memory is released.
server logs
CMD to run server
CT2_VERBOSE=1 python3.10 -m flask run --host=0.0.0.0 -p 3000
Here I am attaching images of server memory usage before crashing and after crashing. after the app is crashed whole memory is getting released
Vidoe https://drive.google.com/file/d/1SfxOLEnwCVH9AzRS_mtg1V8bmpiZdTfI/view
I have tried the following solutions, but none of them worked:
Please suggest a way to fix this issue.