Closed brittlewis12 closed 8 months ago
whoops, it was indeed my mistake in the conversion!
turns out that, while the base instruct model uses a fast tokenizer, this model instead uses the regular llama tokenizer. which means I should've converted with BPE!
reconverted & quantized and what do you know, it runs great.
I doubt it's worth investigating the crash on its own given the incorrectly produced model file.
But maybe there could be a way to detect this sort of mistake at conversion time to short circuit this process? Auto vocab-type detection would be beneficial, but that's out of the scope of this issue.
reconverted & quantized and what do you know, it runs great.
Huh, that's surprising. There is a long pending PR that I thought needs to be merged to support DeepSeek models: #5464. It should fix some tokenization problems AFAICT + add conversion
I'm surprised that it worked for you
the updated fp16 conversion and quants just finished uploading: hf link
it does seem to work fine tho! I haven't tested it too extensively, but:
that script just calls main with in-prefix/-suffix, ngl, temp, etc.
Model: OpenCodeInterpreter-DS-6.7B (GGUFs)
This is a deepseek coder instruct-based model, llama arch, but maybe there's something distinct for it that requires special-handling?
Or maybe I did something wrong in converting these files from the original safetensors (used the same build, b2249, for converting, quantizing, and running).
Both
-ngl=999
&-ngl=0
produce the same exception:llama.cpp build info
b2249
(rev:15499eb94227401bdc8875da6eb85c15d37068f7
)LLAMA_METAL=1
lldb stacktrace
full lldb output from `./main`:
``` (lldb) target create "./main" Current executable set to '/Users/tito/code/llama.cpp/main' (arm64). (lldb) settings set -- target.run-args "-m" "/Users/tito/code/autogguf/OpenCodeInterpreter-DS-6.7B/opencodeinterpreter-ds-6.7b.Q4_K_M.gguf" "-t" "7" "--color" "--ctx_size" "4096" "--keep" "4" "--in-prefix" "<|User|>\\n" "--in-suffix" "\\n<|Assistant|>\\n" "-r" "<|User|>" "-r" "<|Assistant|>" "-r" "<|EOT|>" "-ins" "-b" "512" "-n" "-1" "--temp" "0.7" "--repeat_penalty" "1.1" "-ngl" "0" (lldb) breakpoint set -E C++ Breakpoint 1: no locations (pending). (lldb) run Process 25487 launched: '/Users/tito/code/llama.cpp/main' (arm64) 2 locations added to breakpoint 1 Log start main: build = 2249 (15499eb9) main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0 main: seed = 1708707124 llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /Users/tito/code/autogguf/OpenCodeInterpreter-DS-6.7B/opencodeinterpreter-ds-6.7b.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = . llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000 llama_model_loader: - kv 11: llama.rope.scaling.type str = linear llama_model_loader: - kv 12: llama.rope.scaling.factor f32 = 4.000000 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32256] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 32013 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32021 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 32014 llama_model_loader: - kv 21: tokenizer.chat_template str = {%- set found_item = false -%}\n{%- fo... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors Process 25487 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x0000000188223330 libc++abi.dylib`__cxa_throw libc++abi.dylib`__cxa_throw: -> 0x188223330 <+0>: pacibsp 0x188223334 <+4>: stp x22, x21, [sp, #-0x30]! 0x188223338 <+8>: stp x20, x19, [sp, #0x10] 0x18822333c <+12>: stp x29, x30, [sp, #0x20] (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 * frame #0: 0x0000000188223330 libc++abi.dylib`__cxa_throw frame #1: 0x00000001000684c0 main`std::__1::__throw_out_of_range[abi:v160006](char const*) + 60 frame #2: 0x000000010006a790 main`llama_byte_to_token(llama_vocab const&, unsigned char) + 472 frame #3: 0x000000010003d270 main`llama_model_load(std::__1::basic_stringconversion info