Open wingenlit opened 4 months ago
Got the exactly same issue when loading Mistral-Nemo-2407 model using LMStudio that is also based on Llama.cpp. Waiting for the fix!
I don't know if it's a relevant issue reported in https://github.com/ggerganov/llama.cpp/issues/8577, btw.
We're excited about Nemo too. Once support is implemented upstream, we naturally intend to find a way to incorporate it here.
UPDATE
Llama.cpp had added support on mistral-nemo at version b3436
onwards. Therefore, llamafile will be updated soon.
For information only, as a result some earlier gguf checkpoints using fork version of llama.cpp might not work with latest llama.cpp. The version of gguf I am using thanks to bartowski is tested working. Repo from others might be updated to work soon.
p.s. the default context size for mistral-nemo is huge at 128k; tricked me thinking a leak of memory happened at the first time. Advised to use the context size flag with a smaller --ctx-size 10000
to start with, and then pull up until vram is adequately used.
I can't cherry-pick ggerganov/llama.cpp#50e05353e88d50b644688caa91f5955e8bdb9eb9 because the code it touches has had a considerable amount of churn upstream recently. It'll have to wait until the next full synchronization with upstream. Right now I'm focused primarily on developing a new server. Contributions are welcome on backporting Nemo support. I know this feature is important too so @stlhood should probably chime in on where our priorities should be. Upstream has also been making problematic changes to ggml-cuda lately that prevent us from using it the way it's written, since upstream refused our request to add #ifdef
statements that would make sync simpler by disabling features that significantly increase code size.
sorry about just closing the issue without the inside knowledge. will wait for the problem being resolved.
UPDATE: recent testing update here. llamafile-0.8.13
works with mistral-nemo now; great!
unfortunately, it is distinctively slower than llama-cpp (my version is b3949
). What am I missing here?
LLAMAFILE (compile flagged, fully offloaded)
> llamafile-0.8.13 -m Mistral-Nemo-Instruct-2407-IQ4_NL.gguf --temp 0.222 -p "write a story as long as possible" -b 512 -c 512 -n 500 -ngl 999
...
llama_print_timings: load time = 1138.73 ms
llama_print_timings: sample time = 32.21 ms / 500 runs ( 0.06 ms per token, 15523.61 tokens per second)
llama_print_timings: prompt eval time = 36.57 ms / 8 tokens ( 4.57 ms per token, 218.78 tokens per second)
llama_print_timings: eval time = 15008.03 ms / 499 runs ( 30.08 ms per token, 33.25 tokens per second)
llama_print_timings: total time = 15135.38 ms / 507 tokens
LLAMA-CPP (compiled CUDA, fully offloaded)
> ./llama-cpp-b3949/llama-cli -m Mistral-Nemo-Instruct-2407--IQ4_NL.gguf --temp 0.222 -p "write a story as long as possible" -b 512 -c 512 -n 500 -ngl 999
...
llama_perf_sampler_print: sampling time = 22.08 ms / 508 runs ( 0.04 ms per token, 23005.16 tokens per second)
llama_perf_context_print: load time = 2006.12 ms
llama_perf_context_print: prompt eval time = 24.86 ms / 8 tokens ( 3.11 ms per token, 321.84 tokens per second)
llama_perf_context_print: eval time = 8395.15 ms / 499 runs ( 16.82 ms per token, 59.44 tokens per second)
llama_perf_context_print: total time = 8471.78 ms / 507 tokens
Platform: Ubuntu-22.04 CUDA 12.4 CPU: 7800x3D GPU: RTX 4070 FE
Contact Details
No response
What happened?
Hi there, I have just attempted to run the new Mistral-Nemo with llamafile on a gguf file quantized with llama.cpp b3405. It failed with error in
unknown pre-tokenizer type: 'mistral-bpe'
(logs shown below). Is there a replacement string type to use with--override-kv tokenizer.ggml.pre=str:{some_tokenizer_type_here}
or I should just wait for the future versions?thanks in advance.
Version
llamafile v0.8.9
What operating system are you seeing the problem on?
Linux, Windows
Relevant log output