Feature Request: Support for Additional vLLM Configuration Settings

Luffyzm3D2Y commented 3 months ago

Description

I would like to inquire if there are any plans to support more configuration settings for vLLM, specifically related to RoPE scaling and theta adjustments.

Background

vLLM currently provides some configuration options such as --rope-scaling and --rope-theta, as documented in the vLLM documentation.

--rope-scaling: RoPE scaling configuration in JSON format. For example, {"type":"dynamic","factor":2.0}.
--rope-theta: RoPE theta. Use with rope_scaling. In some cases, changing the RoPE theta improves the performance of the scaled model.

Request

Are there any plans to expand the support for more configuration settings related to RoPE scaling and theta adjustments?
If so, could you provide a timeline or any relevant details about the upcoming features?

Thank you for considering this feature request. I look forward to your response.

jeffreymeetkai commented 3 months ago

Hi, thank you for your questions.

Yes, there are plans currently to expand context lengths up to 32K using YaRN and/or Dynamic Scaling across vLLM/TGI/direct Transformers usage. For vLLM, we will need to evaluate vLLM's YaRN/Dynamic NTK Scaling/LongRoPE/etc. via the extended model's performance before we can officially include the configuration settings.
Unfortunately, I cannot give a specific timeline regarding this but I can assure you that we are working on it right now and this is one of the highest priority task currently.

Luffyzm3D2Y commented 3 months ago

Hi,

Thank you for your detailed response and for sharing the plans for expanding context lengths using YaRN and/or Dynamic Scaling. I appreciate the insight into your current priorities and the efforts being made to evaluate these features.

Given the importance of extended model capabilities for my current work, I would greatly appreciate it if you could provide instructions on how to support RoPE scaling in vLLM. Having this information will allow me to proceed with my work, and perhaps I could contribute to the evaluation and development process.

Thank you again for your assistance.

Hi, thank you for your questions.

Yes, there are plans currently to expand context lengths up to 32K using YaRN and/or Dynamic Scaling across vLLM/TGI/direct Transformers usage. For vLLM, we will need to evaluate vLLM's YaRN/Dynamic NTK Scaling/LongRoPE/etc. via the extended model's performance before we can officially include the configuration settings.

Unfortunately, I cannot give a specific timeline regarding this but I can assure you that we are working on it right now and this is one of the highest priority task currently.

jeffreymeetkai commented 3 months ago

I believe that engine_args.rope_scaling and engine_args.rope_theta were only brought back/implemented recently. Our vLLM dependency version is older. Therefore, you can try migrating the vLLM version to the latest and test server_vllm.py without grammar sampling with the rope-scaling and/or rope-theta command line arguments respectively.

I haven't tested whether there are any problems with migrating the vLLM version but the chances should be low without grammar sampling as our vllm monkey patch will not be used. Do let me know if you encounter any issues when migrating.

Luffyzm3D2Y commented 3 months ago

@jeffreymeetkai Thank you for your reply. I just test when migrating the vLLM version to v5.0.0, and run the following command: CUDA_VISIBLE_DEVICES=1,2 python3 server_vllm.py --model "meetkai/functionary-small-v2.5" --host 0.0.0.0 --max-model-len 8192

The error information is below:

Traceback (most recent call last):
  File "/data1/user/functionary/server_vllm.py", line 29, in <module>
    from vllm.entrypoints.openai.protocol import (
ImportError: cannot import name 'LogProbs' from 'vllm.entrypoints.openai.protocol' (/home/user/miniconda3/envs/functionary/lib/python3.10/site-packages/vllm/entrypoints/openai/protocol.py)

It seems the LogProbs has been changed in latest version of vllm. I'm just trying to fix it. And I think maybe it's much easier to migrate quickly if disabling grammar sampling and just setting tool choice 'auto'. I will appreciate it if you could explain the importing of LogProbs and the function def create_logprobs .

jeffreymeetkai commented 3 months ago

Hi, thank you for pointing this out. It turns out that the vLLM community did many revamps big and small on the codebase in the 2-3 months since we pinned the dependency version to v0.4.1 so i just raised #215 for review. It migrates the vllm dependency version to v0.5.0 and should work both with and without grammar sampling now.

Additionally, you can pass in rope scaling now. Here's an example:

python server_vllm.py --model meetkai/functionary-small-v2.5 --rope-scaling '{"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 8192}' --rope-theta 500000.0

Make sure to not pass in --max-model-len as it will overwrite the rope scaling. This keeps the context window to whatever --max-model-len is set to.

The PR will be merged soon. Alternatively, you can hop on to this branch if you want to start evaluating before the PR is merged.

Edit: The PR is merged now.

Luffyzm3D2Y commented 3 months ago

Thank you for your help in resolving the issue! It seems to be running normally now with updated vllm. I will keep you posted with any feedback.

jeffreymeetkai commented 3 months ago

You're welcome, looking forward to any interesting findings! Do pull from the latest main as there are some more small bug fixes regarding running longer context lengths on server_vllm.py implemented.

MeetKai / functionary