Bug: unknown pre-tokenizer type: ''mistral-bpe" when running the new Mistral-Nemo model

wingenlit commented 4 months ago

Contact Details

No response

What happened?

Hi there, I have just attempted to run the new Mistral-Nemo with llamafile on a gguf file quantized with llama.cpp b3405. It failed with error in unknown pre-tokenizer type: 'mistral-bpe' (logs shown below). Is there a replacement string type to use with --override-kv tokenizer.ggml.pre=str:{some_tokenizer_type_here} or I should just wait for the future versions?

./llamafile-0.8.9 --cli -m /mnt/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --temp 0.2 -p "write something here:" -ngl 999 --no-display-prompt

thanks in advance.

Version

llamafile v0.8.9

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'mistral-bpe'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf'

raymondllu commented 4 months ago

Got the exactly same issue when loading Mistral-Nemo-2407 model using LMStudio that is also based on Llama.cpp. Waiting for the fix!

I don't know if it's a relevant issue reported in https://github.com/ggerganov/llama.cpp/issues/8577, btw.

jart commented 4 months ago

We're excited about Nemo too. Once support is implemented upstream, we naturally intend to find a way to incorporate it here.

wingenlit commented 3 months ago

UPDATE

Llama.cpp had added support on mistral-nemo at version b3436 onwards. Therefore, llamafile will be updated soon.

For information only, as a result some earlier gguf checkpoints using fork version of llama.cpp might not work with latest llama.cpp. The version of gguf I am using thanks to bartowski is tested working. Repo from others might be updated to work soon. p.s. the default context size for mistral-nemo is huge at 128k; tricked me thinking a leak of memory happened at the first time. Advised to use the context size flag with a smaller --ctx-size 10000 to start with, and then pull up until vram is adequately used.

jart commented 3 months ago

I can't cherry-pick ggerganov/llama.cpp#50e05353e88d50b644688caa91f5955e8bdb9eb9 because the code it touches has had a considerable amount of churn upstream recently. It'll have to wait until the next full synchronization with upstream. Right now I'm focused primarily on developing a new server. Contributions are welcome on backporting Nemo support. I know this feature is important too so @stlhood should probably chime in on where our priorities should be. Upstream has also been making problematic changes to ggml-cuda lately that prevent us from using it the way it's written, since upstream refused our request to add #ifdef statements that would make sync simpler by disabling features that significantly increase code size.

wingenlit commented 3 months ago

sorry about just closing the issue without the inside knowledge. will wait for the problem being resolved.

wingenlit commented 1 month ago

UPDATE: recent testing update here. llamafile-0.8.13 works with mistral-nemo now; great! unfortunately, it is distinctively slower than llama-cpp (my version is b3949). What am I missing here?

LLAMAFILE (compile flagged, fully offloaded)

> llamafile-0.8.13 -m Mistral-Nemo-Instruct-2407-IQ4_NL.gguf --temp 0.222 -p "write a story as long as possible" -b 512 -c 512 -n 500 -ngl 999
...
llama_print_timings:        load time =    1138.73 ms
llama_print_timings:      sample time =      32.21 ms /   500 runs   (    0.06 ms per token, 15523.61 tokens per second)
llama_print_timings: prompt eval time =      36.57 ms /     8 tokens (    4.57 ms per token,   218.78 tokens per second)
llama_print_timings:        eval time =   15008.03 ms /   499 runs   (   30.08 ms per token,    33.25 tokens per second)
llama_print_timings:       total time =   15135.38 ms /   507 tokens

LLAMA-CPP (compiled CUDA, fully offloaded)

> ./llama-cpp-b3949/llama-cli -m Mistral-Nemo-Instruct-2407--IQ4_NL.gguf --temp 0.222 -p "write a story as long as possible" -b 512 -c 512 -n 500 -ngl 999
...
llama_perf_sampler_print:    sampling time =      22.08 ms /   508 runs   (    0.04 ms per token, 23005.16 tokens per second)
llama_perf_context_print:        load time =    2006.12 ms
llama_perf_context_print: prompt eval time =      24.86 ms /     8 tokens (    3.11 ms per token,   321.84 tokens per second)
llama_perf_context_print:        eval time =    8395.15 ms /   499 runs   (   16.82 ms per token,    59.44 tokens per second)
llama_perf_context_print:       total time =    8471.78 ms /   507 tokens

Platform: Ubuntu-22.04 CUDA 12.4 CPU: 7800x3D GPU: RTX 4070 FE

Mozilla-Ocho / llamafile