Since vllm 0.2.7, CUDA graph generation is on by default, which takes up to 3 Gio of VRAM in addition to the model. On my hardware, this means I trigger an Out Of Memory error.
vllm supports the argument "enforce_eager = True" to disable graph generation, but I was not able to pass this argument from the openllm command line interface.
Feature request
Support passing --enforce_eager to vllm from the command line like so:
Motivation
Since vllm 0.2.7, CUDA graph generation is on by default, which takes up to 3 Gio of VRAM in addition to the model. On my hardware, this means I trigger an Out Of Memory error.
vllm supports the argument "enforce_eager = True" to disable graph generation, but I was not able to pass this argument from the
openllm
command line interface.Other
Thank you for your time !