b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

Add additional chat templates to dllama-api #73

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 1 month ago

I've added a few of the most common chat templates, namely llama2, llama3, chatml and openchat. This should make a lot more models compatible with distributed-llama's API

Also added an additional argument to AppArgs to let you specify the chat template used by the model on startup instead of on a per request basis.

DifferentialityDevelopment commented 1 month ago

Openchat has finetunes for llama2 and llama3 so I've just added openchat3 which should work with their llama 3 finetune, where as openchat should work with their llama 2 finetune

DifferentialityDevelopment commented 1 month ago

I was able to successfully test openchat-3.6-8b and it worked correctly with the openchat3 chat template -> https://huggingface.co/openchat/openchat-3.6-8b-20240522 Converted it using convert-hf.py and converted the tokenizer with convert-tokenizer-llama3.py

./dllama-api.exe --model D:\openchat-3.6-8b-20240522-distributed\dllama_model_openchat-3.6-8b-20240522_q40.m --tokenizer D:\openchat-3.6-8b-20240522-distributed\dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --chat-template openchat3 --port 10111 ­ƒÆí arch: llama ­ƒÆí hiddenAct: silu ­ƒÆí dim: 4096 ­ƒÆí hiddenDim: 14336 ­ƒÆí nLayers: 32 ­ƒÆí nHeads: 32 ­ƒÆí nKvHeads: 8 ­ƒÆí vocabSize: 128256 ­ƒÆí seqLen: 8192 ­ƒÆí nSlices: 1 ­ƒÆí ropeTheta: 500000.0 ­ƒôä bosId: 128000 ­ƒôä eosId: 128001 ­ƒòÆ ropeCache: 131072 kB ÔÅ® Loaded 1981264 kB Listening on 0.0.0.0:10111... Server URL: http://127.0.0.1:10111/v1/ ­ƒöÀ POST /v1/chat/completions ­ƒö©In gardens of beauty, roses stand tall, Their vibrant hues, a sight to behold. With petals of passion, they charm all, And whispers of love, they­ƒöÂ

(Excuse the weird characters, windows terminal can't render those symbols correctly.)

I was not able to test openchat-3.5 as although I could convert the model using convert-hf.py, I could not convert the tokenizer. I don't think we have support for mistral yet, but will try and test mixtral with the chatml template, or using a finetune of llama3 that uses the chatml template

https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B uses chatml template so will test with that.

DifferentialityDevelopment commented 1 month ago

I am having the weirdest issue, if I run Hermes-2-Theta-Llama-3-8B using the llama 3 converted tokenizer, it works fine, although it is missing some tokens since it's from a different model of the same architecture, Hermes-2-Theta-Llama-3-8B doesn't have a tokenizer.model so I was a bit in a jam as to what to do.

I put together a convert-tokenizer-hf.py script that's meant to do the same as convert-tokenizer-llama3.py except it uses transformers AutoTokenizer to pull all the necessary data to build the tokenizer.t file

But I think I am doing something wrong as when I run dllama with the generated tokenizer.t file it crashes when encoding text.

convert-tokenizer-hf.py

import sys
import struct
from transformers import AutoTokenizer

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('Invalid usage')
        exit(1)

    tokenizer_path = sys.argv[1]
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    bos_token = tokenizer.bos_token
    eos_token = tokenizer.eos_token

    bosId = tokenizer.convert_tokens_to_ids(bos_token)
    eosId = tokenizer.convert_tokens_to_ids(eos_token)

    tokens = []
    scores = []

    # Get vocab size and tokens
    for token_id in range(tokenizer.vocab_size):
        token = tokenizer.convert_ids_to_tokens(token_id)
        bytes = token.encode('utf-8')
        tokens.append(bytes)
        scores.append(float(token_id))

    # Get special tokens
    special_tokens = tokenizer.added_tokens_decoder
    special_token_index = tokenizer.vocab_size
    for token_id in special_tokens:
        token = tokenizer.convert_ids_to_tokens(token_id)
        bytes = token.encode('utf-8')
        score = special_token_index
        tokens.append(bytes)
        scores.append(float(score))
        special_token_index += 1

    vocab_size = len(tokens)
    max_token_length = max(len(t) for t in tokens)

    with open('dllama_tokenizer_llama3.t', 'wb') as outputFile:
        outputFile.write(struct.pack('IIIiii',
                                     0x567123,
                                     vocab_size,
                                     max_token_length,
                                     bosId,
                                     eosId,
                                     -1))

        for i in range(vocab_size):
            outputFile.write(struct.pack('fI', scores[i], len(tokens[i])))
            outputFile.write(tokens[i])

        print(f'maxTokenLength={max_token_length}')
        print(f'bosId={bosId}')
        print(f'eosId={eosId}')
        print(f'vocabSize={vocab_size}')
b4rtaz commented 1 month ago

I'm wondering if this is a good direction. I mean for sure the source code should not include all possible templates. Maybe this is something that should be moved to the tokenizer file.

Basically now the tokenizer contains: magic|n_words|max_token_length|bos_id|eos_id|pad_id|<dictionary>. But we could add a new optional fields like:

So this design assumes there may be differences in the chat mode.

At the end the converter would be responsible for setting correct values. So this would be not a responsibility of DL.

WDYT?

DifferentialityDevelopment commented 1 month ago

I've tried to type a reply twice but keep getting an blue screen just as I'm about to send :/

Converting the tokenizer is very quick so, in the long run it's probably good to use that route, I just wanted to add a few of the common chat templates ie llama 2, llama 3 and chatml as that already covers the majority of models.

The bigger issue I have is with the script I showed above, I cannot create tokenizers for some models as they do not have the tokenizer.model file, so I tried creating something to convert using AutoTokenizer but the converted tokenizer doesn't work for some reason, dllama error's out at tokenizer.cpp line 202, for instance this model: https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B

b4rtaz commented 1 month ago

@DifferentialityDevelopment please check this PR. This may solve the problem for different models.

b4rtaz commented 1 month ago

@DifferentialityDevelopment probably this would require updating the tokenizer in your repository on HuggingFace. Please, don't do this until the PR is not merged. Later, I want to test a different model.

b4rtaz commented 1 month ago

@DifferentialityDevelopment can you update the tokenizer file in your HF repository to the new format?