Closed b4rtaz closed 1 month ago
Do you maybe know how I'd do the tokenizer conversion for models that don't have a tokenizer.model file?
@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.
I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses tokenizer_config.json
and tokenizer.json
files.
How to convert the tokenizer:
python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
ā Found chat template:
{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
ā To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
ā© Enter value for chat template key "chat_message_start":
ā© Enter value for chat template key "chat_role_start":
<|im_start|>
ā© Enter value for chat template key "chat_role_end":
\n
ā© Enter value for chat template key "chat_message_end":
<|im_end|>\n
ā© Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'}
ā
Created dllama_tokenizer_hermes.t
So far I have:
b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js
> system: You are an excellent math teacher.
> user: What is 1 + 2?
{ completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 }
Ä D1Ä +Ä D2Ä isÄ theÄ sumÄ ofÄ twoÄ distances,Ä D1Ä andÄ D2.Ä ItÄ isÄ aÄ conceptÄ usedÄ inÄ geometryÄ andÄ trigonometryÄ toÄ relateÄ theÄ lengthsÄ ofÄ twoÄ sidesÄ ofÄ aÄ triangle.Ä TheÄ formulaÄ forÄ D1Ä +Ä D2Ä is:Ä D1Ä +Ä D2Ä =Ä sqrt((x2Ä -Ä x1)^2Ä +Ä (y2Ä -Ä y1)^2),Ä whereÄ (x1,Ä y1)Ä andÄ (x2,Ä y2)Ä areÄ theÄ coordinatesÄ ofÄ theÄ twoÄ points.Ä ThisÄ formulaÄ isÄ usedÄ toÄ findÄ theÄ distanceÄ betweenÄ twoÄ pointsÄ inÄ aÄ two-dimensionalÄ space.Ä DoÄ youÄ haveÄ anyÄ specificÄ questionsÄ aboutÄ thisÄ concept?Ä <|im_end
If I replace manualy Ä
=>
.
D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end
The tokenizer is not easy part here. :)
@DifferentialityDevelopment I think there is always a tokenizer somewhere but not always the format is obvious.
I'm trying to convert the tokenizer of the hermes model that you linked. I created a new converter that uses
tokenizer_config.json
andtokenizer.json
files.How to convert the tokenizer:
python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes ā Found chat template: {{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %} ā To create the tokenizer file you need to manually specify chat template values. Enter \n for new line. ā© Enter value for chat template key "chat_message_start": ā© Enter value for chat template key "chat_role_start": <|im_start|> ā© Enter value for chat template key "chat_role_end": \n ā© Enter value for chat template key "chat_message_end": <|im_end|>\n ā© Enter value for chat template key "chat_generation_prompt": <|im_start|>assistant\n {'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 256, 'chat_template': 5} {'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n'} ā Created dllama_tokenizer_hermes.t
So far I have:
b4rtaz@b4rtazs-MacBook-Pro examples % node chat-api-client.js > system: You are an excellent math teacher. > user: What is 1 + 2? { completion_tokens: 128, prompt_tokens: 54, total_tokens: 182 } Ä D1Ä +Ä D2Ä isÄ theÄ sumÄ ofÄ twoÄ distances,Ä D1Ä andÄ D2.Ä ItÄ isÄ aÄ conceptÄ usedÄ inÄ geometryÄ andÄ trigonometryÄ toÄ relateÄ theÄ lengthsÄ ofÄ twoÄ sidesÄ ofÄ aÄ triangle.Ä TheÄ formulaÄ forÄ D1Ä +Ä D2Ä is:Ä D1Ä +Ä D2Ä =Ä sqrt((x2Ä -Ä x1)^2Ä +Ä (y2Ä -Ä y1)^2),Ä whereÄ (x1,Ä y1)Ä andÄ (x2,Ä y2)Ä areÄ theÄ coordinatesÄ ofÄ theÄ twoÄ points.Ä ThisÄ formulaÄ isÄ usedÄ toÄ findÄ theÄ distanceÄ betweenÄ twoÄ pointsÄ inÄ aÄ two-dimensionalÄ space.Ä DoÄ youÄ haveÄ anyÄ specificÄ questionsÄ aboutÄ thisÄ concept?Ä <|im_end
If I replace manualy
Ä
=>.
D1 + D2 is the sum of two distances, D1 and D2. It is a concept used in geometry and trigonometry to relate the lengths of two sides of a triangle. The formula for D1 + D2 is: D1 + D2 = sqrt((x2 - x1)^2 + (y2 - y1)^2), where (x1, y1) and (x2, y2) are the coordinates of the two points. This formula is used to find the distance between two points in a two-dimensional space. Do you have any specific questions about this concept? <|im_end
The tokenizer is not easy part here. :)
Your definitely closer than I got, mine flat out crashed when trying to use the converted tokenizer.
I'll see what I can do to help.
Ok, now after I replaced manually all Ä
=> ` in
tokenizer.config` and executed the converter:
python3 convert-tokenizer-hf.py /Users/b4rtaz/Downloads/Hermes-2-Theta-Llama-3-8B hermes
ā Found chat template:
{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
ā To create the tokenizer file you need to manually specify chat template values. Enter \n for new line.
ā© Enter value for chat template key "chat_message_start":
ā© Enter value for chat template key "chat_role_start":
<|im_start|>
ā© Enter value for chat template key "chat_role_end":
\n
ā© Enter value for chat template key "chat_message_end":
<|im_end|>\n
ā© Enter value for chat template key "chat_generation_prompt":
<|im_start|>assistant\n
ā© Enter value for chat template key "chat_extra_stop":
<|im_start|>
{'bos_id': 128000, 'eos_id': 128003, 'chat_eos_id': 128003, 'version': 0, 'vocab_size': 128256, 'max_token_length': 192, 'chat_template': 6}
{'chat_message_start': '', 'chat_role_start': '<|im_start|>', 'chat_role_end': '\n', 'chat_message_end': '<|im_end|>\n', 'chat_generation_prompt': '<|im_start|>assistant\n', 'chat_extra_stop': '<|im_start|>'}
ā
Created dllama_tokenizer_hermes.t
It seems the Hermes 2 works quite good.
Awesome stuff @b4rtaz!
This PR extends the tokenizer file format. Now it's possible to add to the tokenzier file the chat configuration.