Open dreamiter opened 1 month ago
Hi @frankfliu - would you be able to help? Thanks.
We are planning a release that will use vllm 0.6.0 (or 0.6.1.post2) soon.
In the meantime, you can try providing a requirements.txt file with vllm==0.5.5 (or later version) to get around this.
Thank you @siddvenk for your suggestions.
I tried rebuilding the custom image by running pip install vllm==0.5.5
in a Dockerfile, from your latest stable image 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
We specified the followings in serving.properties
file:
option.model_id=unsloth/mistral-7b-instruct-v0.3
option.engine=Python
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.enable_lora=true
option.gpu_memory_utilization=0.95
option.max_model_len=16000
option.max_lora_rank=128
We tried setting max_token to a really high number but the response is still very short.
We also get this log, and appears the vLLM backend does not support max_tokens
param.
The following parameters are not supported by vllm with rolling batch: {'logprobs', 'temperature', 'seed', 'max_tokens'}. The supported parameters are set()
Do you have any insights?
Yes, you should use max_new_tokens
.
You can find the schema for our inference api here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/lmi_input_output_schema.md.
We also support the openai chat completions schema, details here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/chat_input_output_schema.md.
Thanks again for your quick response @siddvenk -
Just want to make sure, should we:
max_new_tokens
to the serving.properties file
, e.g. option.max_new_tokens=16000
max_new_tokens
as a parameter when invoking the endpoint, such as
curl -X POST https://my.sample.endpoint.com/invocations \
- H 'Content-Type: application/json' \
- d '
{
"inputs" : "What is Deep Learning?",
"parameters" : {
"do_sample": true,
"max_new_tokens": 16000,
"details": true,
},
"stream": true,
}'
btw, forgot to mention, we are deploying this to sagemaker
There are two different configurations.
On a per request basis, you can specify max_new_tokens
to limit the number of generated tokens. This is just a limit on the output, not on the total sequence length.
You can limit the maximum length of sequences globally by setting option.max_model_len
in serving.properties. This enforces a limit that applies to all requests, which includes both the input (prompt) tokens and generated output tokens.
Thanks, @siddvenk .
We did more tests and it turns out the "short response token" issue was only specific to the custom image I built (mentioned above).
So we suspect we missed some key steps when building the image - can you help us review our process?
Steps:
|- Dockerfile
|- requirements.txt
Dockerfile
:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
COPY ./requirements.txt /opt/requirements.txt
RUN pip install --upgrade pip && \ pip install awscli --trusted-host pypi.org --trusted-host files.pythonhosted.org && \ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r /opt/requirements.txt \
3. In `requirements.txt`:`
vllm==0.5.5
4. Build the new docker image using `docker build`
Description
In the current version (using LMI sagemaker image), we are running into the following error:
Looks like above error was fixed in vllm version v0.5.5. See release notes here: https://github.com/vllm-project/vllm/releases/tag/v0.5.5 See PR here: https://github.com/vllm-project/vllm/pull/7146
References
N/A