ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.92k stars 3.66k forks source link

Whisper.cpp consumes unusually large amounts of system memory when transcribing very long wave files #2310

Open techno156 opened 4 months ago

techno156 commented 4 months ago

Bug Description

I have a pcm_16le-encoded 16kHz wave file (about 3 GiB in size), extracted from a video that's approximately 10 hours long (https://www.twitch.tv/videos/2201134431), that I would like to pass to whisper for transcription, using the tiny model. However, attempts to transcribe this file result in the machine running out of memory, and the process either being killed, or the machine requiring a restart.

Machine Specifications

CPU: Intel Core i3 - 4130T Memory: 8GB + 8GB Swap OS Version: Arch Linux 6.9.9 Whisper.cpp Version: 1.6.2-1

Replication Steps:

  1. Obtain a very long audio file (for reference, I'm using the audio from this video, which I've converted to a wave file). To avoid corruption, a 64-bit header was used for the wav file.
  2. Attempt to output an srt/vtt transcription from the file, using the tiny model.
  3. whisper.cpp eats up all the RAM and swap, and is either killed, or effectively crashes the machine.

Alternatives/Workarounds Tried

One workaround that I've tried that seems to work, is to break the file up into smaller segments, and process the segments individually. However, this requires some more complicated than the relatively simple 1 - 2 lines. Splitting by time risks interrupting someone talking, and splitting by silence requires more complex scripting, if it's possible to find a large enough gap to split by to begin with. Splitting also causes problems when using whisper.cpp to create subtitles for a video, as the segments would all need to have their timings adjusted when recombined into a single file that can be applied to the original video.

A variant of the above is to split the video file instead of just the extracted audio, which would resolve the timing issue, but that would introduce additional overhead and coding complexity, as the segments would need to have the subtitles added, then be recombined in the same sequence as the original video.

Coalbus commented 3 months ago

Throwing my experience in as well. I have a similar issue, even when using GPU (GTX 1070) for transcription. The GPU is properly utilized once the transcription actually starts, but for long-running audio, it consumes an immense amount of system memory in the pre-transcription phase. On the GPU, it only consumes about 4.8GB of vram.

The video from which the audio is being transcribed is 24 hours long, but was recorded in 160p so it's only about 2.5GB in size. I kept running out of ram and finally got the transcription to succeed after quite a bit of trial and error, wherein I ended up with a 41GB swap file in addition to the 8GB of system memory. While watching in htop I can see both memory and swap usage balloon to consume just about all of it .

This much memory usage seems excessive, but then again I have no clue what it's doing under the hood so maybe it's normal.

model: medium (doesn't seem to matter particularly which model I use)

bydeus commented 3 months ago

my case => audio file split

swswsws583 commented 2 weeks ago

About 30GB of SSD space is used whenever I transcribe something using the large-v2 model, the only way to get that space back is to restart my Mac. If there's any way to reclaim that space without restarting, that'll be great.

whisper.cpp version 1.7.1

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v2.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:    Metal total size =  3093.99 MB
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init_gpu: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
whisper_backend_init: using BLAS backend
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 80
whisper_init_state: kv self size  =  251.66 MB
whisper_init_state: kv cross size =  251.66 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-v2-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv)   =   10.21 MB
whisper_init_state: compute buffer (cross)  =   16.93 MB
whisper_init_state: compute buffer (decode) =  215.82 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0