MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.32k stars 274 forks source link

Deployment options #215

Open cristobal-larach opened 1 week ago

cristobal-larach commented 1 week ago

Hi @MahmoudAshraf97 ! Awesome job!!. I am really impressed on the accuracy for spanish transcriptions, so i am pretty excited that you shared this with the community.

I have been reading, and i haven't really found which would be the most cost efficient way to deploy this kind of service for a high level volume. Any ideas on this matter? Thank you VERY much!

MahmoudAshraf97 commented 1 week ago

Hi, the cheapest way to deploy such product is to use serverless deployment with spot instance if you can handle the latency or use on-demand instances, you'll create a serverless function with an http trigger that spins up an instance to process the file and return the results and then spins down the instance, that way you will pay only for your usage

cristobal-larach commented 1 week ago

Great! Does it have to be serverless supporting GPU's? Could i use CPU's and mantain quality to the detriment of time?

transcriptionstream commented 1 week ago

As an alternative to a cloud/hosted instance, you can achieve incredible savings by running the product locally for a high volume of processing. GPU's are a must in my opinion.

cristobal-larach commented 1 week ago

@transcriptionstream I do not think i have the option/equipment to run it locally :(. My biggest concern is wether to run an API on a "cheap" GPU like NVIDIA T4 on AWS or other provider, or go for a serverless solution (maybe installing libraries and imports takes a little to much time and thus the accumulated cost rises a lot). My estimation is that i will be processing around 6000-7000 hours a month. Only affordable API at the time is groq cloud, but it does not support diarization (which is a must for me)

transcriptionstream commented 1 week ago

Bummer! If you end up hosting, based on my experience, you're going to want an RTX a6000 to start with for that level of volume, adding from there depending on how close to "real time" you want to achieve. That is not an insignificant amount of audio.