Open AmgadHasan opened 10 months ago
Hi,
Can you share the code you are using to run the model with HuggingFace transformers?
Also what parameters do you set when converting and then loading the model to CTranslate2?
Hi,
Can you share the code you are using to run the model with HuggingFace transformers?
Also what parameters do you set when converting and then loading the model to CTranslate2?
Sure.
Here's the code: HF Transformers
import transformers
import torch
import time
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
chatbot = transformers.pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.float16,
device_map="auto",
)
chatbot.pad_token_id=tokenizer.eos_token_id,
# Warm up the model
chatbot(
"Who is the president of the US?",
do_sample=False,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=10,
)
input = llama2_chat_prompt_template.format(transcript=transcript)
start = time.time()
sequences = chatbot(
[input, input, input],
do_sample=False,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=1200,
batch_size=5
)
end = time.time()
t = end-start
FOR CTranslate2
import ctranslate2
import transformers
import time
# Load and warmup the model
start = time.time()
generator = ctranslate2.Generator("/content/Llama-2-13B-Chat-ct2", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16")
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode("WHo is the president of the US?"))
results = generator.generate_batch([tokens], max_length=5, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
end = time.time()
t = end - start
print("Time:\t", t)
print(output)
# Run speed test
input = llama2_chat_prompt_template.format(transcript=transcript)
start = time.time()
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
end = time.time()
t = end-start
Conversion script:
# Unquantized
import time
import os
start = time.time()
os.system("ct2-transformers-converter --model TheBloke/Llama-2-13B-Chat-fp16 --quantization float16 --output_dir Llama-2-13B-Chat-ct2")
end = time.time()
t = end - start
print("Time:\t", t)
Hi.
I have been doing some benchmarks on nvidia V100 32GB gpu.
First, I benchmarked Llama2-7B-chat using huggingface transformers and CTranslate2. I saw reduced latency when using ct2 ( 12 seconds vs 7.5 seconds respectively)
However, when I tried the 13B version, I didn't see any improvement in latency at all (18 seconds vs 18 seconds) although there's a little bit of reduction of vRAM.
Why is this happening? Did I do something wrong?
This is the code that I am using