Version 1.62 uses more memory than 1.61 for Mixtral 8x7b models

LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

4.41k stars 318 forks source link

Version 1.62 uses more memory than 1.61 for Mixtral 8x7b models #780

Open RyenNelsen opened 3 months ago

RyenNelsen commented 3 months ago

Various Mixtral 8x7b models that I have tried to load utilizing CPU only on version 1.62 are consuming a great deal more RAM compared to version 1.61. Using version 1.61, my system memory would hover around 26GB, but when attempting to load on version 1.62, it skyrockets to 48GB+.

This is before any queries are made to the model and the same settings are used between versions.

Platform: Windows

Edit: This is also happening with the "cuda" version with offloading layers. Similar issue where CPU memory continues to grow while loading the model into memory.

Version 1.62.2 (before the model ever loads, system starts paging): 570ZFngBon

Version 1.61.2 (fully loaded): Taskmgr_FcOo0AJ3Em

majick commented 3 months ago

Yep, totally unusable with Mixtral8x7b models — near-instant CUDA OOM with settings that succeed under 1.61.2

LostRuins commented 3 months ago

Seems to work fine for me. Is mmap enabled? Do you encounter the same issue with stock llama.cpp?

RyenNelsen commented 3 months ago

There seems to be a regression with llama.cpp between versions b2586 and b2589. Any version greater than b2586 causes the same issue if you are using OpenBLAS or CUDA. Disabling mmap does work as a workaround. I opened issue ggerganov/llama.cpp#6652 to get a fix upstream.

Since this is an issue with llama.cpp, feel free to close this issue @LostRuins. I don't know if you want to keep this open or not.

LostRuins commented 3 months ago

I'll keep it open and track it. Thanks.

Foxy6670 commented 3 months ago

Okay, so I'm not the only one that noticed this! I was using the 4x7B model which used 5GB of RAM (I think) regularly, and now it requests 13GB, more than double my current system memory (6GB).

LostRuins commented 2 months ago

According to upstream, they say that the model needs to be reconverted.