Closed BBC-Esq closed 5 months ago
Finally was able to solve the issue. Here are the steps to run the new Llama3 model.
Get the model from here: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
llama3
files there's a folder named original
. Go into it and copy params.json
and tokenizer.model
into the parent folder. Forget about copying consolidated.00.pth
.Convert the model to the ctranslate2 format. You can do this one of two ways:
If you correctly install Pyside6 and run the script you should see an easy-to-understand GUI that will convert any compatible model. You only need to select the folder containing the model files:
config.json
file created during the conversion process will have a key/value that will BORK your script. Specifically, it will specify "null"
for the "unk_token"
. THIS IS IMPROPER. Change it to anything else, but preferably something descriptive like this:llama3
royally blows, but once you get it, you can finally hardcode it and forget it about it...prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
Note, this does NOT entail a multi-turn conversation with memory. For that you'll need to consult the llama3
github repository for how to continue using the prompt format. But more importantly, you'll have to construct a ctranslate2 script that can utilize the prompting format as well as manage "memory..." @ guillikam created a basic script with "memory," albeit for llama2
, which are in the docs. To reiterate, MY EXAMPLE IS ONLY for a single-turn question - i.e. for rag-based applications.
"system_message"
can be anything, but I like "You are a helpful assistant who answers questions in a succinct fashion based on the contexts given to you"
"user_message"
can be anything.I've complained before about the dearth of examples on how to use stereotypical "chat" models. The only helpful example from the "Docs" was for Falcon, which I adapted. Anyhow, I'll provide my full script below for the benefit of the community, but before that, it's helpful to understand a few things:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
relies on the AutoTokenizer
class within the transformers
library to determine the correct tokenizer to use. Once the tokenizer is instantiated, transformers
does have a method named apply_chat_format
that allows you to apply the prompt formatting to one or more messages to/from llama3
. However, for my purposes I like the hardcoded prompt format to see it, but primarily because I only need a single answer...just be aware that transformers
offers that, however, if you want to build a chat session with memory.AutoTokenizer
is also different than using sentencepiece
, which is what's used in the llama2
example in the "Docs" for this repository.tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
is critical because the "generator" within ctranslate2
itself requires a different format than if you use the transformers
library directly. In other words, you're using Autotokenizer
, which automatically selects the tokenizer to use, which, in turn, prepares the data in a format that a model being run with transformers
needs it. However, since we converted the model to the ctranslate2
format, this line is absolutely necessary. If we used sentencepiece
, for example, this wouldn't be necessary, which is why a lot of the older ctranslate2
examples use sentencepiece
I presume...Anyhow, just understand this.generate_batch
method instead of generate_tokens
. You might want to use the latter if, for example, you are building a chat bot and/or simply want tokens streamed back to you and displayed as they're being streamed. I did this because (1) generate_tokens
does not allow for the beam_size
parameter, which I wanted to test and (2) my RAG use-case involves relatively short responses. For longer responses you might consider generate_tokens
to improve the user experience.generate_batch
method, you MUST, MUST MUST (and did I forget to mention..."MUST") use the end_token
parameter and set it to "<|eot_id|>"
and then use return_end_token=False
. If you don't, the LLM will talk to itself until it reaches the token limit.ctranslate2
doesn't have a do_sample
parameter like transformers
that can be set to true or false, my other parameters try to mimic this greedy/deterministic approach, which works great for RAG. Just be aware...Without further ado, here is a sample script. I have put in all caps placeholders for personal information as well as things that depend on your use case;
I'd love to hear from true experts on ctranslate2
teaching me the proper way to convert llama3
other otherwise using it...I'm sure my novice understanding of python
leads to an excessive amount of time spent on running something basic. Thanks!
I push the MR #1671 to fix llama 3. There is only a problem about the unk_token
because it does not exist in the config of Llama 3. I also added a script chat.py
like Llama 2 as an example of using template for Llama 3.
Deleting my initial message because it contained paths on my computer and my follow up post below addresses the error I was getting anyways!