deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
198 stars 64 forks source link

Upgrade to support latest vLLM version (max_lora_rank) #2389

Open dreamiter opened 1 month ago

dreamiter commented 1 month ago

Description

In the current version (using LMI sagemaker image), we are running into the following error:

File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1288, in __post_init__
raise ValueError(
ValueError: max_lora_rank (128) must be one of (8, 16, 32, 64)

Looks like above error was fixed in vllm version v0.5.5. See release notes here: https://github.com/vllm-project/vllm/releases/tag/v0.5.5 See PR here: https://github.com/vllm-project/vllm/pull/7146

References

N/A

dreamiter commented 1 month ago

Hi @frankfliu - would you be able to help? Thanks.

siddvenk commented 1 month ago

We are planning a release that will use vllm 0.6.0 (or 0.6.1.post2) soon.

In the meantime, you can try providing a requirements.txt file with vllm==0.5.5 (or later version) to get around this.

dreamiter commented 1 month ago

Thank you @siddvenk for your suggestions.

I tried rebuilding the custom image by running pip install vllm==0.5.5 in a Dockerfile, from your latest stable image 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

We specified the followings in serving.properties file:

option.model_id=unsloth/mistral-7b-instruct-v0.3
option.engine=Python
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.enable_lora=true
option.gpu_memory_utilization=0.95
option.max_model_len=16000
option.max_lora_rank=128

We tried setting max_token to a really high number but the response is still very short. We also get this log, and appears the vLLM backend does not support max_tokens param.

The following parameters are not supported by vllm with rolling batch: {'logprobs', 'temperature', 'seed', 'max_tokens'}. The supported parameters are set()

Do you have any insights?

siddvenk commented 1 month ago

Yes, you should use max_new_tokens.

You can find the schema for our inference api here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/lmi_input_output_schema.md.

We also support the openai chat completions schema, details here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/chat_input_output_schema.md.

dreamiter commented 1 month ago

Thanks again for your quick response @siddvenk -

Just want to make sure, should we:

dreamiter commented 1 month ago

btw, forgot to mention, we are deploying this to sagemaker

siddvenk commented 1 month ago

There are two different configurations.

On a per request basis, you can specify max_new_tokens to limit the number of generated tokens. This is just a limit on the output, not on the total sequence length.

You can limit the maximum length of sequences globally by setting option.max_model_len in serving.properties. This enforces a limit that applies to all requests, which includes both the input (prompt) tokens and generated output tokens.

dreamiter commented 1 month ago

Thanks, @siddvenk .

We did more tests and it turns out the "short response token" issue was only specific to the custom image I built (mentioned above).

So we suspect we missed some key steps when building the image - can you help us review our process?

Steps:

  1. Create following files
    |- Dockerfile
    |- requirements.txt
  2. In Dockerfile:
    
    FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

Copy files

COPY ./requirements.txt /opt/requirements.txt

Installs third-party Python dependencies within the Docker environment

RUN pip install --upgrade pip && \ pip install awscli --trusted-host pypi.org --trusted-host files.pythonhosted.org && \ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r /opt/requirements.txt \

3. In `requirements.txt`:`

vllm==0.5.5


4. Build the new docker image using `docker build`