I am trying to serve TheBloke/Mistral-7B-Instruct-v0.1-GPTQ model.
I'm currently serving the model via jina and internally I am doing the predictions as:
input_ids = text_tokenizer.encode(prompt, return_tensors="pt").cuda()
# Generate output with custom configuration
output_ids = text_model.generate(input_ids, **gen_config)
generated_ids = output_ids[:, num_input_tokens:]
# Decode only the generated part
output = text_tokenizer.decode(generated_ids[0], skip_special_tokens=True)
Here manually we are generating the output tokens and then decoding it.
For my test prompt, the performance is as:
time taken: ~2.5sec
token/sec: ~6
Now, while serving it via OpenLLM using following command:
openllm start TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --quantize gptq --backend pt
For the same test prompt, the performance is as:
time taken: ~30sec
token/sec: ~0.5
Env:
OS: Ubuntu 20.04
GPU: Nvidia T4 (16GiB vRAM)
Cuda: 11.8
I want to discuss why is the difference so big.
Am I doing something wrong while serving with OpenLLM?
Let me know your thoughts.
Hello everyone,
I am trying to serve
TheBloke/Mistral-7B-Instruct-v0.1-GPTQ
model.I'm currently serving the model via
jina
and internally I am doing the predictions as:Here manually we are generating the output tokens and then decoding it. For my test prompt, the performance is as:
Now, while serving it via OpenLLM using following command:
openllm start TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --quantize gptq --backend pt
For the same test prompt, the performance is as:Env:
I want to discuss why is the difference so big. Am I doing something wrong while serving with OpenLLM? Let me know your thoughts.
Thanks