THUDM / LongWriter

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Apache License 2.0
1.51k stars 138 forks source link

Error making gguf: KeyError: '<|user|>' #14

Open arch-btw opened 3 months ago

arch-btw commented 3 months ago

System Info / 系統信息

transformers: 4.44.0 llama.cpp: latest

Hi, when I try to make a gguf I get this error:

Traceback (most recent call last): File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 4074, in main() File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 4068, in main model_instance.write() File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 388, in write self.prepare_metadata(vocab_only=False) File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 381, in prepare_metadata self.set_vocab() File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 3713, in set_vocab special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])


KeyError: '<|user|>'

Do you know how to fix this?

On huggingface someone else has the same problem:

https://huggingface.co/THUDM/LongWriter-glm4-9b/discussions/1#66bc33eccd16fda66e7caa1f

But I don't know how to apply this solution:

Hi! You can get the token id by tokenizer.get_command("<|user|>").

Is the EOT even needed?

Thank you!

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

With llama.cpp:

python convert_hf_to_gguf.py /home/david/llm/LongWriter-glm4-9b --outtype f32

Here is the code:

    special_vocab = gguf.SpecialVocab(dir_model, load_merges=False)
    special_vocab.merges = merges
    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
    # this one is usually not in config.json anyway
    special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"])
    special_vocab.add_to_gguf(self.gguf_writer)

Expected behavior / 期待表现

For it to make a quantization.

bys0318 commented 3 months ago

Hi! You can get the token id by tokenizer.get_command("<|user|>").

echnio commented 3 months ago

Hi! You can get the token id by tokenizer.get_command("<|user|>").

Hi, How to fix it ? thanks!

bys0318 commented 3 months ago

Have you updated to our most recent model files? Also, please use transformers>=4.43.0.

arch-btw commented 3 months ago

@bys0318 thank you, it appears that the token id is:

151336

Is this correct?

in llama.cpp:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

I did this (find the lines starting at 3711) and replace:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway
bys0318 commented 3 months ago

This is correct. Thanks for sharing!

echnio commented 3 months ago

@bys0318谢谢,看来令牌 ID 是:

151336

这是正确的吗?

在 llama.cpp 中:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

我这样做了(找到从 3711 开始的行)并替换:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

Thank you very much, the format conversion was successful.

aashish-1904 commented 3 months ago

@bys0318 thank you, it appears that the token id is:

151336

Is this correct?

in llama.cpp:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

I did this (find the lines starting at 3711) and replace:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

Thanks for the detailed steps! I was able to convert the model. PLease find quants at QuantFactory/LongWriter-glm4-9b-GGUF