bentoml / OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
9.4k stars 597 forks source link

feat: support enforce_eager option from cli #1003

Closed ADcorpo closed 6 days ago

ADcorpo commented 1 month ago

Feature request

Support passing --enforce_eager to vllm from the command line like so:

openllm start repo/model --port 3000 --enforce_eager

Motivation

Since vllm 0.2.7, CUDA graph generation is on by default, which takes up to 3 Gio of VRAM in addition to the model. On my hardware, this means I trigger an Out Of Memory error.

vllm supports the argument "enforce_eager = True" to disable graph generation, but I was not able to pass this argument from the openllm command line interface.

Other

Thank you for your time !

bojiang commented 6 days ago

supported in 0.6, but in a different way: you can draft your own model / set of config in bentoml/openllm-models