OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.37k stars 299 forks source link

Extremely slow generation speed for llama 2 70B chat model #1388

Open k21993 opened 1 year ago

k21993 commented 1 year ago

I was able to benchmark llama 2 7B chat (int 8) and was able to get ~600 tokens in about 12s on an A100 GPU whereas the HF pipeline takes about 25s for the same input and params.

However, when I try the llama v2 70B chat model (int 8) its extremely slow (~90s) for 500 tokens vs HF pipeline which takes ~32s (although pipeline uses multiple GPUs so its not a fair comparison?). Is this expected or am I doing something wrong?

Here's my code:

import ctranslate2

CT2_INT8_MODEL_CKPT_LLAMA_7B = "llama-2-7b-chat-ct2"
CT2_INT8_MODEL_CKPT_LLAMA_70B = "llama-2-70b-chat-ct2"

generator = ctranslate2.Generator(CT2_INT8_MODEL_CKPT_LLAMA_70B, device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained(LLAMA_PATH_7B)

def predict(prompt:str):
    "Generate text give a prompt"
    start = time.perf_counter()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    results = generator.generate_batch([tokens],
                                       sampling_temperature=0.8,
                                       sampling_topk=0,
                                       sampling_topp=1,
                                       max_length=1000,
                                       include_prompt_in_result=False)
    tokens = results[0].sequences_ids[0]
    output = tokenizer.decode(tokens)
    request_time = time.perf_counter() - start
    return {'tok_count': len(tokens),
            'time': request_time,
            'question': prompt,
            'answer': output,
            'note': 'CTranslate2 int8 quantization'}

import time
print('benchmarking ctranslate2...\n')
time_taken = []
results = []

for _ in range(10):
    start = time.perf_counter()
    out = predict("explain rotary positional embeddings")
    print(out)
    results.append(out)
    request_time = time.perf_counter() - start
    time_taken.append(request_time)
guillaumekln commented 1 year ago

although pipeline uses multiple GPUs so its not a fair comparison?

Well yes, using multiple GPUs will be faster.

For CTranslate2 you might also want to use int8_float16 instead of int8.