OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.01k stars 268 forks source link

Weird behavior on V100 32GB #1431

Open AmgadHasan opened 10 months ago

AmgadHasan commented 10 months ago

Hi.

I have been doing some benchmarks on nvidia V100 32GB gpu.

First, I benchmarked Llama2-7B-chat using huggingface transformers and CTranslate2. I saw reduced latency when using ct2 ( 12 seconds vs 7.5 seconds respectively)

However, when I tried the 13B version, I didn't see any improvement in latency at all (18 seconds vs 18 seconds) although there's a little bit of reduction of vRAM.

Why is this happening? Did I do something wrong?

This is the code that I am using

input = llama2_chat_prompt_template.format(transcript=transcript)

start = time.time()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

end = time.time()

t = end-start

print(f"GPU:\tV100\nTime(s):\t{t}\nResult: {output}")
guillaumekln commented 10 months ago

Hi,

Can you share the code you are using to run the model with HuggingFace transformers?

Also what parameters do you set when converting and then loading the model to CTranslate2?

AmgadHasan commented 10 months ago

Hi,

Can you share the code you are using to run the model with HuggingFace transformers?

Also what parameters do you set when converting and then loading the model to CTranslate2?

Sure.

Here's the code: HF Transformers

import transformers
import torch
import time

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
chatbot = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

chatbot.pad_token_id=tokenizer.eos_token_id,
# Warm up the model
chatbot(
    "Who is the president of the US?",
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=10,
)
input = llama2_chat_prompt_template.format(transcript=transcript)
start = time.time()
sequences = chatbot(
    [input, input, input],
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=1200,
    batch_size=5
)
end = time.time()
t = end-start

FOR CTranslate2

import ctranslate2
import transformers
import time

# Load and warmup the model
start = time.time()
generator = ctranslate2.Generator("/content/Llama-2-13B-Chat-ct2", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16")
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode("WHo is the president of the US?"))
results = generator.generate_batch([tokens], max_length=5, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
end = time.time()
t = end - start
print("Time:\t", t)
print(output)

# Run speed test
input = llama2_chat_prompt_template.format(transcript=transcript)

start = time.time()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

end = time.time()

t = end-start

Conversion script:

# Unquantized
import time
import os
start = time.time()
os.system("ct2-transformers-converter --model TheBloke/Llama-2-13B-Chat-fp16 --quantization float16 --output_dir Llama-2-13B-Chat-ct2")
end = time.time()
t = end - start
print("Time:\t", t)