Open RyenNelsen opened 3 months ago
Yep, totally unusable with Mixtral8x7b models — near-instant CUDA OOM with settings that succeed under 1.61.2
Seems to work fine for me. Is mmap enabled? Do you encounter the same issue with stock llama.cpp?
There seems to be a regression with llama.cpp between versions b2586 and b2589. Any version greater than b2586 causes the same issue if you are using OpenBLAS or CUDA. Disabling mmap does work as a workaround. I opened issue ggerganov/llama.cpp#6652 to get a fix upstream.
Since this is an issue with llama.cpp, feel free to close this issue @LostRuins. I don't know if you want to keep this open or not.
I'll keep it open and track it. Thanks.
Okay, so I'm not the only one that noticed this! I was using the 4x7B model which used 5GB of RAM (I think) regularly, and now it requests 13GB, more than double my current system memory (6GB).
According to upstream, they say that the model needs to be reconverted.
Various Mixtral 8x7b models that I have tried to load utilizing CPU only on version 1.62 are consuming a great deal more RAM compared to version 1.61. Using version 1.61, my system memory would hover around 26GB, but when attempting to load on version 1.62, it skyrockets to 48GB+.
This is before any queries are made to the model and the same settings are used between versions.
Platform: Windows
Edit: This is also happening with the "cuda" version with offloading layers. Similar issue where CPU memory continues to grow while loading the model into memory.
Version 1.62.2 (before the model ever loads, system starts paging):
Version 1.61.2 (fully loaded):