Inference Speed comparison

Hello everyone,

I am trying to serve TheBloke/Mistral-7B-Instruct-v0.1-GPTQ model.

I'm currently serving the model via jina and internally I am doing the predictions as:

input_ids = text_tokenizer.encode(prompt, return_tensors="pt").cuda()

# Generate output with custom configuration
output_ids = text_model.generate(input_ids, **gen_config)

generated_ids = output_ids[:, num_input_tokens:]
# Decode only the generated part
output = text_tokenizer.decode(generated_ids[0], skip_special_tokens=True)

Here manually we are generating the output tokens and then decoding it. For my test prompt, the performance is as:

time taken: ~2.5sec token/sec: ~6

Now, while serving it via OpenLLM using following command: openllm start TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --quantize gptq --backend pt For the same test prompt, the performance is as:

time taken: ~30sec token/sec: ~0.5
Env:
- OS: Ubuntu 20.04
- GPU: Nvidia T4 (16GiB vRAM)
- Cuda: 11.8

I want to discuss why is the difference so big. Am I doing something wrong while serving with OpenLLM? Let me know your thoughts.

Thanks

bentoml / OpenLLM

Inference Speed comparison #795