EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.44k stars 248 forks source link

dolphin-2.9-mixtral-8x22b.Q8_0.gguf "Error: cannot find tensor info for blk.0.ffn_gate.0.weight"? #352

Closed psyv282j9d closed 2 months ago

psyv282j9d commented 3 months ago

I attempted to run mistralrs-server to serve my local copy of dolphin-2.9-mixtral-8x22b.Q8_0.gguf. This file isn't available on huggingface because it's broken into four parts here.

Ideally, I'd like to serve from completely offline files. But that's not critical atm.

$ git show --oneline
fc02ccebd8b4 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #348 from EricLBuehler/expose_api

Built with

$ cargo build --release --features metal

And attempting to run all flavors of gguf resulted in:

$ ./target/release/mistralrs-server --serve-ip 127.0.0.1 -p 8888 gguf -t cognitivecomputations/dolphin-2.9-mixtral-8x22b -m cognitivecomputations/dolphin-2.9-mixtral-8x22b -f ~/models/dolphin-2.9-mixtral-8x22b.Q8_0.gguf
2024-05-28T00:09:39.318968Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-05-28T00:09:39.319009Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-28T00:09:39.319029Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-28T00:09:39.319066Z  INFO hf_hub: Token file not found "/Users/psyv/.cache/huggingface/token"
2024-05-28T00:09:39.319080Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/psyv/.cache/huggingface/token", using no HF token.
2024-05-28T00:09:39.319221Z  INFO hf_hub: Token file not found "/Users/psyv/.cache/huggingface/token"
2024-05-28T00:09:39.319228Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/psyv/.cache/huggingface/token", using no HF token.
2024-05-28T00:09:39.321434Z DEBUG ureq::stream: connecting to huggingface.co:443 at 18.154.227.67:443
2024-05-28T00:09:39.341684Z DEBUG rustls::client::hs: No cached session for DnsName("huggingface.co")
2024-05-28T00:09:39.341758Z DEBUG rustls::client::hs: Not resuming any session
2024-05-28T00:09:39.365238Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256
2024-05-28T00:09:39.365255Z DEBUG rustls::client::tls13: Not resuming
2024-05-28T00:09:39.365339Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: [ServerNameAck]
2024-05-28T00:09:39.365345Z DEBUG rustls::client::hs: ALPN protocol is None
2024-05-28T00:09:39.365548Z DEBUG ureq::stream: created stream: Stream(RustlsStream)
2024-05-28T00:09:39.365553Z DEBUG ureq::unit: sending request GET https://huggingface.co/api/models/cognitivecomputations/dolphin-2.9-mixtral-8x22b/revision/main
2024-05-28T00:09:39.365559Z DEBUG ureq::unit: writing prelude: GET /api/models/cognitivecomputations/dolphin-2.9-mixtral-8x22b/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; hf-hub/0.3.2; rust/unknown
accept-encoding: gzip
2024-05-28T00:09:39.408596Z DEBUG ureq::response: Body entirely buffered (length: 6027)
2024-05-28T00:09:39.408620Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2024-05-28T00:09:39.408627Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/cognitivecomputations/dolphin-2.9-mixtral-8x22b/revision/main
2024-05-28T00:09:39.408840Z DEBUG ureq::stream: dropping stream: Stream(RustlsStream)
2024-05-28T00:09:39.408861Z  INFO mistralrs_core::pipeline::gguf: Loading model `cognitivecomputations/dolphin-2.9-mixtral-8x22b` on Metal(MetalDevice(DeviceId(1)))...
2024-05-28T00:09:39.472560Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 7
general.name: .
general.quantization_version: 2
general.source.url: https://huggingface.co/cognitivecomputations/dolphin-2.9-mixtral-8x22b
general.url: https://huggingface.co/mradermacher/dolphin-2.9-mixtral-8x22b-GGUF
llama.attention.head_count: 48
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 56
llama.context_length: 65536
llama.embedding_length: 6144
llama.expert_count: 8
llama.expert_used_count: 2
llama.feed_forward_length: 16384
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32002
mradermacher.quantize_version: 2
mradermacher.quantized_at: 2024-05-03T03:00:02+02:00
mradermacher.quantized_by: mradermacher
mradermacher.quantized_on: backup1
mradermacher.vocab_type: spm
Error: cannot find tensor info for blk.0.ffn_gate.0.weight

Am I doing something wrong? Or is mistral-8x22b not supported yet (seems unlikely for a project named mistral.rs ;-) )

Here's my diff against HEAD since there's not a verbose flag yet:

diff --git a/mistralrs-server/src/main.rs b/mistralrs-server/src/main.rs
index 361a556b53a4..a0e81a14daba 100644
--- a/mistralrs-server/src/main.rs
+++ b/mistralrs-server/src/main.rs
@@ -254,7 +254,7 @@ async fn main() -> Result<()> {
     let device = Device::cuda_if_available(0)?;

     let filter = EnvFilter::builder()
-        .with_default_directive(LevelFilter::INFO.into())
+        .with_default_directive(LevelFilter::DEBUG.into())
         .from_env_lossy();
     tracing_subscriber::fmt().with_env_filter(filter).init();
EricLBuehler commented 3 months ago

Hi @psyv282j9d!

This is certainly unexpected behavior. I just merged #353, which adds a verbose mode which, as you mentioned, was missing. You can enable it by setting MISTRALRS_DEBUG=1 (see here).

Aside from enabling the DEBUG filter like your diff, it also writes a list of all tensors, names, and shapes if loading GGUF/GGML to mistralrs_gguf_tensors.txt/mistralrs_ggml_tensors.txt. Can you please git pull, re-run, and upload the content of that file here? Thank you!

psyv282j9d commented 3 months ago

Thanks for the quick response!

Here it is mistralrs_gguf_tensors.txt

EricLBuehler commented 3 months ago

Thanks for sending me that! It looks like your GGUF file has the experts in one qtensor, ffn_*_exps instead of what we are currently accepting, which is where they are each in separate qtensors. #355 should enable that functionality, can you please try it out there?

psyv282j9d commented 3 months ago

Progress! Made it to OOM. looks like Q8_0 won't fit in 128GB RAM... Pulling Q6_0 now. Will confirm once it loads and runs

$ MISTRALRS_DEBUG=1 ./target/release/mistralrs-server --serve-ip 127.0.0.1 -p 8888 gguf -t cognitivecomputations/dolphin-2.9-mixtral-8x22b -m cognitivecomputations/dolphin-2.9-mixtral-8x22b -f ~/models/dolphin-2.9-mixtral-8x22b.Q8_0.gguf
2024-05-28T12:14:16.790430Z  INFO mistralrs_core::pipeline::gguf: Loading model `cognitivecomputations/dolphin-2.9-mixtral-8x22b` on Metal(MetalDevice(DeviceId(1)))...
2024-05-28T12:14:16.860447Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 7
general.name: .
general.quantization_version: 2
general.source.url: https://huggingface.co/cognitivecomputations/dolphin-2.9-mixtral-8x22b
general.url: https://huggingface.co/mradermacher/dolphin-2.9-mixtral-8x22b-GGUF
llama.attention.head_count: 48
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 56
llama.context_length: 65536
llama.embedding_length: 6144
llama.expert_count: 8
llama.expert_used_count: 2
llama.feed_forward_length: 16384
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32002
mradermacher.quantize_version: 2
mradermacher.quantized_at: 2024-05-03T03:00:02+02:00
mradermacher.quantized_by: mradermacher
mradermacher.quantized_on: backup1
mradermacher.vocab_type: spm
2024-05-28T12:14:16.860883Z  INFO mistralrs_core::pipeline::gguf: Debug is enabled, wrote the names and information about each tensor to `mistralrs_gguf_tensors.txt`.
Killed: 9
EricLBuehler commented 3 months ago

@psyv282j9d, have you had success with this? Unfortunately, to support the format where the experts are stored in 1 tensor, we need to dequantize, chunk, and quantize again, which requires more memory during loading.

psyv282j9d commented 3 months ago

@EricLBuehler Not yet, my pipe is a bit shallow, and I had the safe tensors already on hand, so I decided to try my hand at quantizing them myself. Then I went further down the rabbit hole and decided to generate an imatrix first. Sadly, I also decided to try out llama.cpp's new gguf support for bf16... sigh. The format support works great, but some of dequantization kernels aren't in place yet (CUDA and metal).

Then, this morning, I see that dolphin-2.9.2-mixtral-8x22b dropped...

End result: I'm pulling 2.9.2, then convert to f32 (fixing the bf16 decision), then imatrix, then quantize. Hopefully, imatrix will take less than four days since I'll be able to use CUDA with f32. (attempted on metal with 128GB ram caused a forced reboot on macOS 🤣)

EricLBuehler commented 3 months ago

Ok, sounds good. Please let me know when you have a chance to run the GGUF file! Have you tried our ISQ feature with your safetensors files?

EricLBuehler commented 3 months ago

@psyv282j9d have you had a chance to try it out? I just merged #355 which enables support for this format on master.

psyv282j9d commented 3 months ago

@EricLBuehler I've tried Q8_0, Q4_K_S, and Q2_K of dolphin-2.9.2-mixtral-8x22b. None of them crash on the original error.

However, every single one of them I eventually kill because they consume over 100GB of RAM, and never open up the TCP listener.

Here's my command:

./target/release/mistralrs-server -p 8080 gguf --tok-model-id cognitivecomputations/dolphin-2.9.2-mixtral-8x22b --quantized-model-id crusoeai/dolphin-2.9.2-mixtral-8x22b-GGUF --quantized-filename ~/models/dolphin-2.9.2-mixtral-8x22b/dolphin-2.9.2-mixtral-8x22b.Q2_K.crusoeai.gguf

I've also tried with -n 27 to load 27 of the 56 layers... That also ate up RAM until I had to kill it.

IIRC, mistralrs might not be using mmap with gguf files, but seems to be doing so with safetensors.

EricLBuehler commented 3 months ago

@psyv282j9d I think this is probably caused because we actually dequantize the experts to split them into the format we need, and they are quantized again. I just opened #434 which does a CUDA device synchronization after this to ensure that the copy is complete before we dequantize again, akin to #433. Can you please try that out?

EricLBuehler commented 2 months ago

@psyv282j9d closing this to avoid stale issues :). Please feel free to reopen for any reason.