feat: add to tokenizer chat configuration.

b4rtaz commented 1 month ago

This PR extends the tokenizer file format. Now it's possible to add to the tokenzier file the chat configuration.

...
 seqLen: 8192
💡 nSlices: 1
💡 ropeTheta: 500000.0
📄 chatTemplate[0]: 
📄 chatTemplate[1]: <|start_header_id|>
📄 chatTemplate[2]: <|end_header_id|>

📄 chatTemplate[3]: <|eot_id|>
📄 chatTemplate[4]: <|start_header_id|>assistant<|end_header_id|>

📄 bosId: 128000
📄 eosId: 128001
📄 chatEosId: 128009
🕒 ropeCache: 131072 kB
⏩ Loaded 6175568 kB

DifferentialityDevelopment commented 1 month ago

Do you maybe know how I'd do the tokenizer conversion for models that don't have a tokenizer.model file?

b4rtaz commented 1 month ago

@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.

I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json and tokenizer.json files.

How to convert the tokenizer:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
✅ Created dllama_tokenizer_hermes.t

So far I have:

b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
ĠD1Ġ+ĠD2ĠisĠtheĠsumĠofĠtwoĠdistances,ĠD1ĠandĠD2.ĠItĠisĠaĠconceptĠusedĠinĠgeometryĠandĠtrigonometryĠtoĠrelateĠtheĠlengthsĠofĠtwoĠsidesĠofĠaĠtriangle.ĠTheĠformulaĠforĠD1Ġ+ĠD2Ġis:ĠD1Ġ+ĠD2Ġ=Ġsqrt((x2Ġ-Ġx1)^2Ġ+Ġ(y2Ġ-Ġy1)^2),ĠwhereĠ(x1,Ġy1)ĠandĠ(x2,Ġy2)ĠareĠtheĠcoordinatesĠofĠtheĠtwoĠpoints.ĠThisĠformulaĠisĠusedĠtoĠfindĠtheĠdistanceĠbetweenĠtwoĠpointsĠinĠaĠtwo-dimensionalĠspace.ĠDoĠyouĠhaveĠanyĠspecificĠquestionsĠaboutĠthisĠconcept?Ġ<|im_end

If I replace manualy Ġ => .

 D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end

The tokenizer is not easy part here. :)

DifferentialityDevelopment commented 1 month ago

@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.

I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json and tokenizer.json files.

How to convert the tokenizer:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
✅ Created dllama_tokenizer_hermes.t

So far I have:

b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
ĠD1Ġ+ĠD2ĠisĠtheĠsumĠofĠtwoĠdistances,ĠD1ĠandĠD2.ĠItĠisĠaĠconceptĠusedĠinĠgeometryĠandĠtrigonometryĠtoĠrelateĠtheĠlengthsĠofĠtwoĠsidesĠofĠaĠtriangle.ĠTheĠformulaĠforĠD1Ġ+ĠD2Ġis:ĠD1Ġ+ĠD2Ġ=Ġsqrt((x2Ġ-Ġx1)^2Ġ+Ġ(y2Ġ-Ġy1)^2),ĠwhereĠ(x1,Ġy1)ĠandĠ(x2,Ġy2)ĠareĠtheĠcoordinatesĠofĠtheĠtwoĠpoints.ĠThisĠformulaĠisĠusedĠtoĠfindĠtheĠdistanceĠbetweenĠtwoĠpointsĠinĠaĠtwo-dimensionalĠspace.ĠDoĠyouĠhaveĠanyĠspecificĠquestionsĠaboutĠthisĠconcept?Ġ<|im_end

If I replace manualy Ġ => .

 D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end

The tokenizer is not easy part here. :)

Your definitely closer than I got, mine flat out crashed when trying to use the converted tokenizer.

I'll see what I can do to help.

b4rtaz commented 1 month ago

Ok, now after I replaced manually all Ġ => ` intokenizer.config` and executed the converter:

python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
⭐ Found chat template:

{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

⭐ To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
⏩ Enter value for chat template key "chat_message_start":

⏩ Enter value for chat template key "chat_role_start":
<|im_start|>
⏩ Enter value for chat template key "chat_role_end":
\n
⏩ Enter value for chat template key "chat_message_end":
<|im_end|>\n
⏩ Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
⏩ Enter value for chat template key "chat_extra_stop":
<|im_start|>
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 192, 'chat_template': 6}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n', 'chat_extra_stop': '<|im_start|>'}
✅ Created dllama_tokenizer_hermes.t

It seems the Hermes 2 works quite good.

DifferentialityDevelopment commented 1 month ago

Awesome stuff @b4rtaz!

b4rtaz / distributed-llama

feat: add to tokenizer chat configuration. #76