Closed psyv282j9d closed 2 months ago
Hi @psyv282j9d!
This is certainly unexpected behavior. I just merged #353, which adds a verbose mode which, as you mentioned, was missing. You can enable it by setting MISTRALRS_DEBUG=1
(see here).
Aside from enabling the DEBUG filter like your diff, it also writes a list of all tensors, names, and shapes if loading GGUF/GGML to mistralrs_gguf_tensors.txt
/mistralrs_ggml_tensors.txt
. Can you please git pull
, re-run, and upload the content of that file here? Thank you!
Thanks for the quick response!
Here it is mistralrs_gguf_tensors.txt
Thanks for sending me that! It looks like your GGUF file has the experts in one qtensor, ffn_*_exps
instead of what we are currently accepting, which is where they are each in separate qtensors. #355 should enable that functionality, can you please try it out there?
Progress! Made it to OOM. looks like Q8_0 won't fit in 128GB RAM... Pulling Q6_0 now. Will confirm once it loads and runs
$ MISTRALRS_DEBUG=1 ./target/release/mistralrs-server --serve-ip 127.0.0.1 -p 8888 gguf -t cognitivecomputations/dolphin-2.9-mixtral-8x22b -m cognitivecomputations/dolphin-2.9-mixtral-8x22b -f ~/models/dolphin-2.9-mixtral-8x22b.Q8_0.gguf
2024-05-28T12:14:16.790430Z INFO mistralrs_core::pipeline::gguf: Loading model `cognitivecomputations/dolphin-2.9-mixtral-8x22b` on Metal(MetalDevice(DeviceId(1)))...
2024-05-28T12:14:16.860447Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 7
general.name: .
general.quantization_version: 2
general.source.url: https://huggingface.co/cognitivecomputations/dolphin-2.9-mixtral-8x22b
general.url: https://huggingface.co/mradermacher/dolphin-2.9-mixtral-8x22b-GGUF
llama.attention.head_count: 48
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 56
llama.context_length: 65536
llama.embedding_length: 6144
llama.expert_count: 8
llama.expert_used_count: 2
llama.feed_forward_length: 16384
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32002
mradermacher.quantize_version: 2
mradermacher.quantized_at: 2024-05-03T03:00:02+02:00
mradermacher.quantized_by: mradermacher
mradermacher.quantized_on: backup1
mradermacher.vocab_type: spm
2024-05-28T12:14:16.860883Z INFO mistralrs_core::pipeline::gguf: Debug is enabled, wrote the names and information about each tensor to `mistralrs_gguf_tensors.txt`.
Killed: 9
@psyv282j9d, have you had success with this? Unfortunately, to support the format where the experts are stored in 1 tensor, we need to dequantize, chunk, and quantize again, which requires more memory during loading.
@EricLBuehler Not yet, my pipe is a bit shallow, and I had the safe tensors already on hand, so I decided to try my hand at quantizing them myself. Then I went further down the rabbit hole and decided to generate an imatrix first. Sadly, I also decided to try out llama.cpp's new gguf support for bf16... sigh. The format support works great, but some of dequantization kernels aren't in place yet (CUDA and metal).
Then, this morning, I see that dolphin-2.9.2-mixtral-8x22b dropped...
End result: I'm pulling 2.9.2, then convert to f32 (fixing the bf16 decision), then imatrix, then quantize. Hopefully, imatrix will take less than four days since I'll be able to use CUDA with f32. (attempted on metal with 128GB ram caused a forced reboot on macOS 🤣)
Ok, sounds good. Please let me know when you have a chance to run the GGUF file! Have you tried our ISQ feature with your safetensors files?
@psyv282j9d have you had a chance to try it out? I just merged #355 which enables support for this format on master
.
@EricLBuehler I've tried Q8_0
, Q4_K_S
, and Q2_K
of dolphin-2.9.2-mixtral-8x22b
. None of them crash on the original error.
However, every single one of them I eventually kill because they consume over 100GB of RAM, and never open up the TCP listener.
Here's my command:
./target/release/mistralrs-server -p 8080 gguf --tok-model-id cognitivecomputations/dolphin-2.9.2-mixtral-8x22b --quantized-model-id crusoeai/dolphin-2.9.2-mixtral-8x22b-GGUF --quantized-filename ~/models/dolphin-2.9.2-mixtral-8x22b/dolphin-2.9.2-mixtral-8x22b.Q2_K.crusoeai.gguf
I've also tried with -n 27
to load 27 of the 56 layers... That also ate up RAM until I had to kill it.
IIRC, mistralrs might not be using mmap with gguf files, but seems to be doing so with safetensors.
@psyv282j9d I think this is probably caused because we actually dequantize the experts to split them into the format we need, and they are quantized again. I just opened #434 which does a CUDA device synchronization after this to ensure that the copy is complete before we dequantize again, akin to #433. Can you please try that out?
@psyv282j9d closing this to avoid stale issues :). Please feel free to reopen for any reason.
I attempted to run
mistralrs-server
to serve my local copy ofdolphin-2.9-mixtral-8x22b.Q8_0.gguf
. This file isn't available on huggingface because it's broken into four parts here.Ideally, I'd like to serve from completely offline files. But that's not critical atm.
Built with
And attempting to run all flavors of
gguf
resulted in:Am I doing something wrong? Or is mistral-8x22b not supported yet (seems unlikely for a project named mistral.rs ;-) )
Here's my diff against
HEAD
since there's not a verbose flag yet: