Open techno156 opened 4 months ago
Throwing my experience in as well. I have a similar issue, even when using GPU (GTX 1070) for transcription. The GPU is properly utilized once the transcription actually starts, but for long-running audio, it consumes an immense amount of system memory in the pre-transcription phase. On the GPU, it only consumes about 4.8GB of vram.
The video from which the audio is being transcribed is 24 hours long, but was recorded in 160p so it's only about 2.5GB in size. I kept running out of ram and finally got the transcription to succeed after quite a bit of trial and error, wherein I ended up with a 41GB swap file in addition to the 8GB of system memory. While watching in htop I can see both memory and swap usage balloon to consume just about all of it .
This much memory usage seems excessive, but then again I have no clue what it's doing under the hood so maybe it's normal.
model: medium (doesn't seem to matter particularly which model I use)
my case => audio file split
About 30GB of SSD space is used whenever I transcribe something using the large-v2 model, the only way to get that space back is to restart my Mac. If there's any way to reclaim that space without restarting, that'll be great.
whisper.cpp version 1.7.1
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v2.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: Metal total size = 3093.99 MB
whisper_model_load: model size = 3093.99 MB
whisper_backend_init_gpu: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
whisper_backend_init: using BLAS backend
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 80
whisper_init_state: kv self size = 251.66 MB
whisper_init_state: kv cross size = 251.66 MB
whisper_init_state: kv pad size = 7.86 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-v2-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv) = 10.21 MB
whisper_init_state: compute buffer (cross) = 16.93 MB
whisper_init_state: compute buffer (decode) = 215.82 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0
Bug Description
I have a pcm_16le-encoded 16kHz wave file (about 3 GiB in size), extracted from a video that's approximately 10 hours long (https://www.twitch.tv/videos/2201134431), that I would like to pass to whisper for transcription, using the
tiny
model. However, attempts to transcribe this file result in the machine running out of memory, and the process either being killed, or the machine requiring a restart.Machine Specifications
CPU: Intel Core i3 - 4130T Memory: 8GB + 8GB Swap OS Version: Arch Linux 6.9.9 Whisper.cpp Version: 1.6.2-1
Replication Steps:
tiny
model.Alternatives/Workarounds Tried
One workaround that I've tried that seems to work, is to break the file up into smaller segments, and process the segments individually. However, this requires some more complicated than the relatively simple 1 - 2 lines. Splitting by time risks interrupting someone talking, and splitting by silence requires more complex scripting, if it's possible to find a large enough gap to split by to begin with. Splitting also causes problems when using whisper.cpp to create subtitles for a video, as the segments would all need to have their timings adjusted when recombined into a single file that can be applied to the original video.
A variant of the above is to split the video file instead of just the extracted audio, which would resolve the timing issue, but that would introduce additional overhead and coding complexity, as the segments would need to have the subtitles added, then be recombined in the same sequence as the original video.