ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.01k stars 9.75k forks source link

convert-hf-to-gguf.py breaks on phi-2 #7219

Open CrispStrobe opened 6 months ago

CrispStrobe commented 6 months ago

this was possible earlier before the bpe pre tokenizer fixes. now it leads to File "/kaggle/working/llama.cpp/./convert-hf-to-gguf.py", line 432, in get_vocab_base_pre raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()") NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

thought this would easily be solved by updating the hashes. but cannot get past "llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi2'" seemingly without code changes.

how is this supposed to be done? like so? And why does the script break when there is no correlate to the pre tokenizer string, instead of just defaulting, as illustrated in this diff?

linpan commented 6 months ago

BPE pre-tokenizer was not recognized

teleprint-me commented 6 months ago

Trying to fix it. Keep tabs on my PR #7117. If Phi-1 works, then Phi-1.5 and Phi-2 should work as well. They all used the same vocab. Phi-1 is registering as Phi-2, but Phi-1.5 and Phi-2 do not. What makes this super weird is that Phi-2 vocab registers as it's own instead of the gpt-2 vocab like its supposed to. @mofosyne This is definitely a bug.

23:14:05 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ ./main --color -e -s 1337 -c 256 -n 256 -p "Create a function that returns a list of a prime numbers based on a given input in Python" -m /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf
Log start
main: build = 2893 (dc020985)
main: built with cc (GCC) 14.1.1 20240507 for x86_64-pc-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 21 key-value pairs and 453 tensors from /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = phi-2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  259 tensors
llama_model_loader: - type  f16:  194 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi-2'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf'
main: error: unable to load model

The pre-tokenizer is registered as phi-2 instead of gpt-2.

llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = phi-2
teleprint-me commented 6 months ago

Okay, yeah. This is definitely a bug. I was able to fix it.

23:29:52 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ ./main --color -e -s 1337 -c 256 -n 256 -p "Create a function that returns a list of a prime numbers based on a given input in Python" -m /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf
Log start
main: build = 2893 (dc020985)
main: built with cc (GCC) 14.1.1 20240507 for x86_64-pc-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 21 key-value pairs and 453 tensors from /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = gpt-2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  259 tensors
llama_model_loader: - type  f16:  194 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 5.18 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors:        CPU buffer size =  5303.65 MiB
.............................................................................................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    80.00 MiB
llama_new_context_with_model: KV self size  =   80.00 MiB, K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.20 MiB
llama_new_context_with_model:        CPU compute buffer size =    52.50 MiB
llama_new_context_with_model: graph nodes  = 1161
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 256, n_batch = 2048, n_predict = 256, n_keep = 0

Create a function that returns a list of a prime numbers based on a given input in Python.

```python
def prime_numbers(n):
    primes = []
    for i in range(2, n):
        is_prime = True
        for j in range(2, i):
            if i % j == 0:
                is_prime = False
                break
        if is_prime:
            primes.append(i)
    return primes

print(prime_numbers(20)) # [2, 3, 5, 7, 11, 13, 17, 19]

Exercise 3:

Create a function that takes two arguments and returns their product using recursion.

def recursive_product(a, b):
    if b == 0:
        return 1
    else:
        return a + recursive_product(a, b - 1)

print(recursive_product(5, 4)) # 20

Exercise 4:

Create a function that takes a list of integers and returns a new list with only even numbers using list comprehension.

def even_numbers(a):
    return [n for
llama_print_timings:        load time =     363.29 ms
llama_print_timings:      sample time =       5.65 ms /   256 runs   (    0.02 ms per token, 45309.73 tokens per second)
llama_print_timings: prompt eval time =     223.51 ms /    18 tokens (   12.42 ms per token,    80.53 tokens per second)
llama_print_timings:        eval time =   32266.33 ms /   255 runs   (  126.53 ms per token,     7.90 tokens per second)
llama_print_timings:       total time =   32525.11 ms /   273 tokens
Log end
CrispStrobe commented 6 months ago

thanks. but we seemingly have now a mismatch as in https://github.com/ggerganov/llama.cpp/issues/4622#issuecomment-1868732668

teleprint-me commented 6 months ago

@CrispStrobe No. This is a different issue. The tokenizers are hashed and then identified this way. The model configuration is registered into a factory and then processed. The vocabulary metadata isn't being identified the right way. The issue you linked is related to llama.cpp runtime. This issue is related to conversion metadata. Vocab mismatch is out of scope for this issue as a result.

Edit: Now you have me thinking whether the vocab mismatch is related, 😅.

teleprint-me commented 6 months ago

Yeah, I think I found it.

@Model.register("PhiForCausalLM")
class Phi2Model(Model):
    # omitting for brevity

    def set_gguf_parameters(self):
        # omitting for brevity

        self.gguf_writer.add_name("Phi2")
        self.gguf_writer.add_tokenizer_pre("gpt-2")  # <- Need this
        self.gguf_writer.add_context_length(self.find_hparam(["n_positions", "max_position_embeddings"]))

        # omitting for brevity

Need feedback.

CrispStrobe commented 6 months ago

doesn't this lead to the same thing as in my above diff, when we arrive in llama.cpp at the switch (vocab.type) in struct llm_tokenizer_bpe? but you seem to be much more familiar with the codebase. i was wondering about set_vocab also.

teleprint-me commented 6 months ago

Yes, you're right. You're probably using the master branch.

First the hash needs to included for the vocab.

Then the line for adding the pre-tokenizer needs to be added as well.

Then voilà! It should work. ✨

You can do this by pulling in my forked branch or doing it manually with the branch you're using and referencing the changes I made in my fork. Doing it manually is really involved if you're unfamiliar with the code base.

The update script would need the line for generating the hash.

    {"name": "phi",            "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-1", },

The convert script would need the get_vocab_base_pre to match the check sum.

        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
            # ref: https://huggingface.co/microsoft/phi-1
            res = "phi"

The convert script would need vocab metadata.

        self.gguf_writer.add_tokenizer_pre("gpt-2")

And that's it.

CrispStrobe commented 6 months ago

thanks. your above example seems to mix phi1 & 2 though? my workaround was to either use an older llama.cpp version for phi2 or to use above linked fix, but of which i was unsure about the consistency with the overall logic. i did not hardcode the hashes but just include them on the fly per the update script like here in the hope that when training runs or model merges change some tokenizer setup, all would work still. but maybe there are some flaws in this approach?

teleprint-me commented 6 months ago

My patch works for phi 1, 1.5, and 2. They all use the same vocab.

teleprint-me commented 6 months ago

I took a look at your notebook. Noticed you're using a custom model, 'phi-2-spaetzle-v4'. Make a mental note that if the vocab changes, so does the hash. Might want to familiarize yourself with doing it manually if you ever need to modify it for some reason or another.