Closed DifferentialityDevelopment closed 1 month ago
Openchat has finetunes for llama2 and llama3 so I've just added openchat3 which should work with their llama 3 finetune, where as openchat should work with their llama 2 finetune
I was able to successfully test openchat-3.6-8b and it worked correctly with the openchat3 chat template -> https://huggingface.co/openchat/openchat-3.6-8b-20240522 Converted it using convert-hf.py and converted the tokenizer with convert-tokenizer-llama3.py
./dllama-api.exe --model D:\openchat-3.6-8b-20240522-distributed\dllama_model_openchat-3.6-8b-20240522_q40.m --tokenizer D:\openchat-3.6-8b-20240522-distributed\dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --chat-template openchat3 --port 10111 ƒÆí arch: llama ƒÆí hiddenAct: silu ƒÆí dim: 4096 ƒÆí hiddenDim: 14336 ƒÆí nLayers: 32 ƒÆí nHeads: 32 ƒÆí nKvHeads: 8 ƒÆí vocabSize: 128256 ƒÆí seqLen: 8192 ƒÆí nSlices: 1 ƒÆí ropeTheta: 500000.0 ƒôä bosId: 128000 ƒôä eosId: 128001 ƒòÆ ropeCache: 131072 kB ÔÅ® Loaded 1981264 kB Listening on 0.0.0.0:10111... Server URL: http://127.0.0.1:10111/v1/ ƒöÀ POST /v1/chat/completions ƒö©In gardens of beauty, roses stand tall, Their vibrant hues, a sight to behold. With petals of passion, they charm all, And whispers of love, theyƒöÂ
(Excuse the weird characters, windows terminal can't render those symbols correctly.)
I was not able to test openchat-3.5 as although I could convert the model using convert-hf.py, I could not convert the tokenizer. I don't think we have support for mistral yet, but will try and test mixtral with the chatml template, or using a finetune of llama3 that uses the chatml template
https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B uses chatml template so will test with that.
I am having the weirdest issue, if I run Hermes-2-Theta-Llama-3-8B using the llama 3 converted tokenizer, it works fine, although it is missing some tokens since it's from a different model of the same architecture, Hermes-2-Theta-Llama-3-8B doesn't have a tokenizer.model so I was a bit in a jam as to what to do.
I put together a convert-tokenizer-hf.py script that's meant to do the same as convert-tokenizer-llama3.py except it uses transformers AutoTokenizer to pull all the necessary data to build the tokenizer.t file
But I think I am doing something wrong as when I run dllama with the generated tokenizer.t file it crashes when encoding text.
convert-tokenizer-hf.py
import sys
import struct
from transformers import AutoTokenizer
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Invalid usage')
exit(1)
tokenizer_path = sys.argv[1]
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
bos_token = tokenizer.bos_token
eos_token = tokenizer.eos_token
bosId = tokenizer.convert_tokens_to_ids(bos_token)
eosId = tokenizer.convert_tokens_to_ids(eos_token)
tokens = []
scores = []
# Get vocab size and tokens
for token_id in range(tokenizer.vocab_size):
token = tokenizer.convert_ids_to_tokens(token_id)
bytes = token.encode('utf-8')
tokens.append(bytes)
scores.append(float(token_id))
# Get special tokens
special_tokens = tokenizer.added_tokens_decoder
special_token_index = tokenizer.vocab_size
for token_id in special_tokens:
token = tokenizer.convert_ids_to_tokens(token_id)
bytes = token.encode('utf-8')
score = special_token_index
tokens.append(bytes)
scores.append(float(score))
special_token_index += 1
vocab_size = len(tokens)
max_token_length = max(len(t) for t in tokens)
with open('dllama_tokenizer_llama3.t', 'wb') as outputFile:
outputFile.write(struct.pack('IIIiii',
0x567123,
vocab_size,
max_token_length,
bosId,
eosId,
-1))
for i in range(vocab_size):
outputFile.write(struct.pack('fI', scores[i], len(tokens[i])))
outputFile.write(tokens[i])
print(f'maxTokenLength={max_token_length}')
print(f'bosId={bosId}')
print(f'eosId={eosId}')
print(f'vocabSize={vocab_size}')
I'm wondering if this is a good direction. I mean for sure the source code should not include all possible templates. Maybe this is something that should be moved to the tokenizer file.
Basically now the tokenizer contains: magic|n_words|max_token_length|bos_id|eos_id|pad_id|<dictionary>
. But we could add a new optional fields like:
chat_role_start
llama3 = <|start_header_id|>
chat_role_end
llama3 = <|end_header_id|>
chat_eos
llama3 = <|eot_id|>
So this design assumes there may be differences in the chat mode.
At the end the converter would be responsible for setting correct values. So this would be not a responsibility of DL.
WDYT?
I've tried to type a reply twice but keep getting an blue screen just as I'm about to send :/
Converting the tokenizer is very quick so, in the long run it's probably good to use that route, I just wanted to add a few of the common chat templates ie llama 2, llama 3 and chatml as that already covers the majority of models.
The bigger issue I have is with the script I showed above, I cannot create tokenizers for some models as they do not have the tokenizer.model file, so I tried creating something to convert using AutoTokenizer but the converted tokenizer doesn't work for some reason, dllama error's out at tokenizer.cpp line 202, for instance this model: https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B
@DifferentialityDevelopment please check this PR. This may solve the problem for different models.
@DifferentialityDevelopment probably this would require updating the tokenizer in your repository on HuggingFace. Please, don't do this until the PR is not merged. Later, I want to test a different model.
@DifferentialityDevelopment can you update the tokenizer file in your HF repository to the new format?
I've added a few of the most common chat templates, namely llama2, llama3, chatml and openchat. This should make a lot more models compatible with distributed-llama's API
Also added an additional argument to AppArgs to let you specify the chat template used by the model on startup instead of on a per request basis.