Mistral-Nemo-Instruct-2407 Q8_0 GGUF: Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 26, 4096], rhs: [26, 32, 160], op: "reshape" }

EricLBuehler / mistral.rs

Blazingly fast LLM inference.

MIT License

3.63k stars 260 forks source link

Describe the bug

Hello, I found in PR #595 that Mistral Nemo Instruct 2407 was supported. It is working really well (using ISQ on HF safetensors).

Are GGUF supported too?

Using the Q8_0 from https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF/tree/main:

Server:

./target/release/./mistralrs-server --port 1234 --throughput gguf --quantized-model-id $D/models/Mistral-Nemo-Instruct-2407 --quantized-filename Mistral-Nemo-Instruct-2407.Q8_0.gguf
2024-07-28T18:13:38.820102Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-07-28T18:13:38.820123Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-07-28T18:13:38.820136Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-07-28T18:13:38.820259Z  INFO mistralrs_core::pipeline::paths: Loading `Mistral-Nemo-Instruct-2407.Q8_0.gguf` locally at `$D/models/Mistral-Nemo-Instruct-2407/Mistral-Nemo-Instruct-2407.Q8_0.gguf`
2024-07-28T18:13:38.820396Z  INFO mistralrs_core::pipeline::gguf: Loading `generation_config.json` at `$D/models/Mistral-Nemo-Instruct-2407`
2024-07-28T18:13:38.820404Z  INFO mistralrs_core::pipeline::gguf: Loading `generation_config.json` locally at `$D/models/Mistral-Nemo-Instruct-2407/generation_config.json`
2024-07-28T18:13:38.820445Z  INFO mistralrs_core::pipeline::gguf: Loading model `$D/models/Mistral-Nemo-Instruct-2407` on cpu.
2024-07-28T18:13:39.392118Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: models-mistralai-Mistral-Nemo
general.file_type: 7
general.finetune: Instruct
general.languages: en, fr, de, es, it, pt, ru, zh, ja
general.license: apache-2.0
general.name: Models Mistralai Mistral Nemo Instruct 2407
general.quantization_version: 2
general.size_label: 12B
general.type: model
general.version: 2407
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.key_length: 128
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.attention.value_length: 128
llama.block_count: 40
llama.context_length: 1024000
llama.embedding_length: 5120
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 131072
quantize.imatrix.chunks_count: 73
quantize.imatrix.dataset: group_40.txt
quantize.imatrix.entries_count: 280
quantize.imatrix.file: ./Mistral-Nemo-Instruct-2407-GGUF_imatrix.dat
2024-07-28T18:13:39.393241Z  INFO mistralrs_core::pipeline::gguf: Debug is enabled, wrote the names and information about each tensor to `mistralrs_gguf_tensors.txt`.
2024-07-28T18:13:39.698852Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 131072, num added tokens: 0, num merges: 269443, num scores: 0
2024-07-28T18:13:39.698873Z  INFO mistralrs_core::gguf::gguf_tokenizer: Tokenizer: Tokenizer(TokenizerImpl { normalizer: None, pre_tokenizer: Some(ByteLevel(ByteLevel { add_prefix_space: false, trim_offsets: true, use_regex: true })), model: BPE(BPE { dropout: None, unk_token: Some("<unk>"), continuing_subword_prefix: None, end_of_word_suffix: None, fuse_unk: false, byte_fallback: false, vocab: 131072, merges: 269443, ignore_merges: false }), post_processor: Some(Template(TemplateProcessing { single: Template([SpecialToken { id: "<s>", type_id: 0 }, Sequence { id: A, type_id: 0 }]), pair: Template([SpecialToken { id: "<s>", type_id: 0 }, Sequence { id: A, type_id: 0 }, Sequence { id: B, type_id: 1 }]), added_single: 1, added_pair: 1, special_tokens: Tokens({"<s>": SpecialToken { id: "<s>", ids: [1], tokens: ["<s>"] }}) })), decoder: Some(ByteLevel(ByteLevel { add_prefix_space: true, trim_offsets: true, use_regex: true })), added_vocabulary: AddedVocabulary { added_tokens_map: {"</s>": 2, "<unk>": 0, "<s>": 1}, added_tokens_map_r: {2: AddedToken { content: "</s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, 1: AddedToken { content: "<s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, 0: AddedToken { content: "<unk>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }}, added_tokens: [], special_tokens: [AddedToken { content: "<s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, AddedToken { content: "</s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, AddedToken { content: "<unk>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }], special_tokens_set: {"<unk>", "</s>", "<s>"}, split_trie: (AhoCorasick(dfa::DFA(
D 000000: \x00-\x0E => 0
F 000016:
* 000032: \x00-\x0E => 0
 matches: 1
* 000048: \x00-\x0E => 0
 matches: 2
* 000064: \x00-\x0E => 0
 matches: 0
 >000080: \x00-\x02 => 80, \x03 => 208, \x04-\x0E => 80
  000096: \x00-\x02 => 0, \x03 => 208, \x04-\x0E => 0
  000112: \x00-\x02 => 80, \x03 => 208, \x04-\n => 80, \x0B => 128, \x0C-\x0E => 80
  000128: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 32, \x06-\x0E => 80
  000144: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 64, \x06-\x0E => 80
  000160: \x00-\x02 => 80, \x03 => 208, \x04-\x08 => 80, \t => 176, \n-\x0E => 80
  000176: \x00-\x02 => 80, \x03 => 208, \x04-\x06 => 80, \x07 => 192, \x08-\x0E => 80
  000192: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 48, \x06-\x0E => 80
  000208: \x00 => 80, \x01 => 112, \x02 => 80, \x03 => 208, \x04-\n => 80, \x0B => 144, \x0C => 80, \r => 160, \x0E => 80
match kind: LeftmostLongest
prefilter: true
state length: 14
pattern length: 3
shortest pattern length: 3
longest pattern length: 5
alphabet length: 15
stride: 16
byte classes: ByteClasses(0 => [0-46], 1 => [47], 2 => [48-59], 3 => [60], 4 => [61], 5 => [62], 6 => [63-106], 7 => [107], 8 => [108-109], 9 => [110], 10 => [111-114], 11 => [115], 12 => [116], 13 => [117], 14 => [118-255])
memory usage: 992
)
), [1, 2, 0]), split_normalized_trie: (AhoCorasick(dfa::DFA(
D 000000: \x00 => 0
F 000001:
 >000002: \x00 => 2
  000003: \x00 => 0
match kind: LeftmostLongest
prefilter: false
state length: 4
pattern length: 0
shortest pattern length: 18446744073709551615
longest pattern length: 0
alphabet length: 1
stride: 1
byte classes: ByteClasses(0 => [0-255])
memory usage: 16
)
), []), encode_special_tokens: false }, truncation: None, padding: None })
2024-07-28T18:13:39.718452Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content'] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n        {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n    {%- endif %}\n    {%- if message['role'] == 'user' %}\n        {%- if loop.last and system_message is defined %}\n            {{- '[INST] ' + system_message + '\n\n' + message['content'] + '[/INST]' }}\n        {%- else %}\n            {{- '[INST] ' + message['content'] + '[/INST]' }}\n        {%- endif %}\n    {%- elif message['role'] == 'assistant' %}\n        {{- ' ' + message['content'] + eos_token}}\n    {%- else %}\n        {{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}\n    {%- endif %}\n{%- endfor %}\n`
2024-07-28T18:14:42.854153Z  INFO mistralrs_core::pipeline::paths: Using literal chat template.
2024-07-28T18:14:43.281455Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk>
2024-07-28T18:14:43.300792Z  INFO mistralrs_server: Model loaded.
2024-07-28T18:14:43.302137Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
2024-07-28T18:17:36.124399Z ERROR mistralrs_core::engine: prompt step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 26, 4096], rhs: [26, 32, 160], op: "reshape" }

Client:

python3 examples/server/chat.py
Enter system prompt >>>                            
>>> Is 22.20 greater than 22.6?
Traceback (most recent call last):
  File "mistral.rs/examples/server/chat.py", line 47, in <module>
    completion = openai.chat.completions.create(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_utils/_utils.py", line 277, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/resources/chat/completions.py", line 643, in create
    return self._post(
           ^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1266, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 942, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1031, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1079, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1031, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1079, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "mistral.rs/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1046, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500 - {'message': 'shape mismatch in reshape, lhs: [1, 26, 4096], rhs: [26, 32, 160]', 'partial_response': {'id': '2', 'choices': [{'finish_reason': 'error', 'index': 0, 'message': {'content': '', 'role': 'assistant'}, 'logprobs': None}], 'created': 1722190658, 'model': '$D/models/Mistral-Nemo-Instruct-2407', 'system_fingerprint': 'local', 'object': 'chat.completion', 'usage': {'completion_tokens': 0, 'prompt_tokens': 26, 'total_tokens': 26, 'avg_tok_per_sec': 1733.3334, 'avg_prompt_tok_per_sec': None, 'avg_compl_tok_per_sec': None, 'total_time_sec': 0.015, 'total_prompt_time_sec': 0.0, 'total_completion_time_sec': 0.0}}}

Latest commit or version

Latest commit 38fb9423cb30a996dbc991a2294ace97443008ae

RUST_BACKTRACE=1 ./target/release/./mistralrs-server -i gguf --quantized-model-id $D/models/Mistral-Nemo-Instruct-2407-GGUF --quantized-filename Mistral-Nemo-Instruct-2407-Q4_K_L.gguf 2024-08-02T02:19:34.960621Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true 2024-08-02T02:19:34.960657Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial 2024-08-02T02:19:34.960674Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters) 2024-08-02T02:19:34.960804Z INFO mistralrs_core::pipeline::paths: Loading `Mistral-Nemo-Instruct-2407-Q4_K_L.gguf` locally at `$D/models/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407-Q4_K_L.gguf` 2024-08-02T02:19:34.960895Z INFO mistralrs_core::pipeline::gguf: Loading model `$D/models/Mistral-Nemo-Instruct-2407-GGUF` on cpu. 2024-08-02T02:19:35.540106Z INFO mistralrs_core::pipeline::gguf: Model config: general.architecture: llama general.basename: Mistral-Nemo general.file_type: 15 general.finetune: Instruct general.languages: en, fr, de, es, it, pt, ru, zh, ja general.license: apache-2.0 general.name: Mistral Nemo Instruct 2407 general.quantization_version: 2 general.size_label: 12B general.type: model general.version: 2407 llama.attention.head_count: 32 llama.attention.head_count_kv: 8 llama.attention.key_length: 128 llama.attention.layer_norm_rms_epsilon: 0.00001 llama.attention.value_length: 128 llama.block_count: 40 llama.context_length: 1024000 llama.embedding_length: 5120 llama.feed_forward_length: 14336 llama.rope.dimension_count: 128 llama.rope.freq_base: 1000000 llama.vocab_size: 131072 quantize.imatrix.chunks_count: 128 quantize.imatrix.dataset: /training_dir/calibration_datav3.txt quantize.imatrix.entries_count: 280 quantize.imatrix.file: /models_out/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407.imatrix 2024-08-02T02:19:35.540602Z INFO mistralrs_core::pipeline::gguf: Debug is enabled, wrote the names and information about each tensor to `mistralrs_gguf_tensors.txt`. 2024-08-02T02:19:35.874385Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 131072, num added tokens: 0, num merges: 269443, num scores: 0 2024-08-02T02:19:35.874402Z INFO mistralrs_core::gguf::gguf_tokenizer: Tokenizer: Tokenizer(TokenizerImpl { normalizer: None, pre_tokenizer: Some(ByteLevel(ByteLevel { add_prefix_space: false, trim_offsets: true, use_regex: true })), model: BPE(BPE { dropout: None, unk_token: Some("<unk>"), continuing_subword_prefix: None, end_of_word_suffix: None, fuse_unk: false, byte_fallback: false, vocab: 131072, merges: 269443, ignore_merges: false }), post_processor: Some(Template(TemplateProcessing { single: Template([SpecialToken { id: "<s>", type_id: 0 }, Sequence { id: A, type_id: 0 }]), pair: Template([SpecialToken { id: "<s>", type_id: 0 }, Sequence { id: A, type_id: 0 }, Sequence { id: B, type_id: 1 }]), added_single: 1, added_pair: 1, special_tokens: Tokens({"<s>": SpecialToken { id: "<s>", ids: [1], tokens: ["<s>"] }}) })), decoder: Some(ByteLevel(ByteLevel { add_prefix_space: true, trim_offsets: true, use_regex: true })), added_vocabulary: AddedVocabulary { added_tokens_map: {"<s>": 1, "</s>": 2, "<unk>": 0}, added_tokens_map_r: {0: AddedToken { content: "<unk>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, 2: AddedToken { content: "</s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, 1: AddedToken { content: "<s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }}, added_tokens: [], special_tokens: [AddedToken { content: "<s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, AddedToken { content: "</s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, AddedToken { content: "<unk>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }], special_tokens_set: {"</s>", "<s>", "<unk>"}, split_trie: (AhoCorasick(dfa::DFA( D 000000: \x00-\x0E => 0 F 000016: * 000032: \x00-\x0E => 0 matches: 1 * 000048: \x00-\x0E => 0 matches: 2 * 000064: \x00-\x0E => 0 matches: 0 >000080: \x00-\x02 => 80, \x03 => 208, \x04-\x0E => 80 000096: \x00-\x02 => 0, \x03 => 208, \x04-\x0E => 0 000112: \x00-\x02 => 80, \x03 => 208, \x04-\n => 80, \x0B => 128, \x0C-\x0E => 80 000128: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 32, \x06-\x0E => 80 000144: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 64, \x06-\x0E => 80 000160: \x00-\x02 => 80, \x03 => 208, \x04-\x08 => 80, \t => 176, \n-\x0E => 80 000176: \x00-\x02 => 80, \x03 => 208, \x04-\x06 => 80, \x07 => 192, \x08-\x0E => 80 000192: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 48, \x06-\x0E => 80 000208: \x00 => 80, \x01 => 112, \x02 => 80, \x03 => 208, \x04-\n => 80, \x0B => 144, \x0C => 80, \r => 160, \x0E => 80 match kind: LeftmostLongest prefilter: true state length: 14 pattern length: 3 shortest pattern length: 3 longest pattern length: 5 alphabet length: 15 stride: 16 byte classes: ByteClasses(0 => [0-46], 1 => [47], 2 => [48-59], 3 => [60], 4 => [61], 5 => [62], 6 => [63-106], 7 => [107], 8 => [108-109], 9 => [110], 10 => [111-114], 11 => [115], 12 => [116], 13 => [117], 14 => [118-255]) memory usage: 992 ) ), [1, 2, 0]), split_normalized_trie: (AhoCorasick(dfa::DFA( D 000000: \x00 => 0 F 000001: >000002: \x00 => 2 000003: \x00 => 0 match kind: LeftmostLongest prefilter: false state length: 4 pattern length: 0 shortest pattern length: 18446744073709551615 longest pattern length: 0 alphabet length: 1 stride: 1 byte classes: ByteClasses(0 => [0-255]) memory usage: 16 ) ), []), encode_special_tokens: false }, truncation: None, padding: None }) 2024-08-02T02:19:35.897140Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content'] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n {%- endif %}\n {%- if message['role'] == 'user' %}\n {%- if loop.last and system_message is defined %}\n {{- '[INST] ' + system_message + '\n\n' + message['content'] + '[/INST]' }}\n {%- else %}\n {{- '[INST] ' + message['content'] + '[/INST]' }}\n {%- endif %}\n {%- elif message['role'] == 'assistant' %}\n {{- ' ' + message['content'] + eos_token}}\n {%- else %}\n {{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}\n {%- endif %}\n{%- endfor %}\n` 2024-08-02T02:20:29.804507Z INFO mistralrs_core::pipeline::paths: Using literal chat template. 2024-08-02T02:20:30.167965Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "</s>", unk_tok = <unk> 2024-08-02T02:20:30.194591Z INFO mistralrs_server: Model loaded. 2024-08-02T02:20:30.194725Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 } > Dags. Do you like dags? 2024-08-02T02:20:44.803643Z ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: ShapeMismatchBinaryOp { lhs: [1, 71, 32, 128], rhs: [1, 71, 5120], op: "reshape" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::tensor::Tensor::reshape" }, { fn: "mistralrs_core::models::quantized_llama::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread" }, { fn: "__GI___clone3" }] } 2024-08-02T02:20:44.804331Z ERROR mistralrs_server::interactive_mode: Got a model error: "shape mismatch in reshape, lhs: [1, 71, 32, 128], rhs: [1, 71, 5120] 0: candle_core::error::Error::bt 1: candle_core::tensor::Tensor::reshape 2: mistralrs_core::models::quantized_llama::ModelWeights::forward 3: <mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs 4: mistralrs_core::pipeline::Pipeline::step::{{closure}} 5: mistralrs_core::engine::Engine::run::{{closure}} 6: std::sys_common::backtrace::__rust_begin_short_backtrace 7: core::ops::function::FnOnce::call_once{{vtable.shim}} 8: std::sys::pal::unix::thread::Thread::new::thread_start 9: start_thread 10: __GI___clone3", response: ChatCompletionResponse { id: "0", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1722565244, model: "$D/models/Mistral-Nemo-Instruct-2407-GGUF", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 71, total_tokens: 71, avg_tok_per_sec: 1203.3899, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.059, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

EricLBuehler / mistral.rs

Mistral-Nemo-Instruct-2407 Q8_0 GGUF: Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 26, 4096], rhs: [26, 32, 160], op: "reshape" } #643

Describe the bug

Latest commit or version