While processing a job with Whisper using CUDA the Python process alone consumed over 21GiB of system memory (not vram!). This resulted in a machine with 16GiB to immediately exit with code 0 and thus not be able to process any of the jobs it retrieved. The "fix" was to add 20GiB of swap. When using CPU the python process running the same job with the same model only consumed about 8GiB of system memory, however htop also reports nearly 20GiB in the VIRT column. This doesn't matter for the cpu runner but is still noteworthy because maybe Whisper keeps stuff on disk and only loads it in chunks into memory when running on CPU while it has to put all the data at once into memory when running on CUDA? Cuda uses Unified memory by default meaning it will keep stuff in system memory and only lazy-load it into VRAM when needed. But this is just a wild guess, in any case this should not happen.
I have no idea how to even begin fixing this. Maybe lets just do #24 and hope faster-whisper doesn't do that :D
While processing a job with Whisper using CUDA the Python process alone consumed over 21GiB of system memory (not vram!). This resulted in a machine with 16GiB to immediately exit with code 0 and thus not be able to process any of the jobs it retrieved. The "fix" was to add 20GiB of swap. When using CPU the python process running the same job with the same model only consumed about 8GiB of system memory, however htop also reports nearly 20GiB in the VIRT column. This doesn't matter for the cpu runner but is still noteworthy because maybe Whisper keeps stuff on disk and only loads it in chunks into memory when running on CPU while it has to put all the data at once into memory when running on CUDA? Cuda uses Unified memory by default meaning it will keep stuff in system memory and only lazy-load it into VRAM when needed. But this is just a wild guess, in any case this should not happen.
I have no idea how to even begin fixing this. Maybe lets just do #24 and hope faster-whisper doesn't do that :D