bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
9.98k stars 635 forks source link

feat: support volta architecture GPUs for the vLLM backend #930

Open K-Mistele opened 7 months ago

K-Mistele commented 7 months ago

Feature request

It would be great if OpenLLM supported pre-Ampere architecture Cuda devices. In my case, I'm looking at the volta architecture.

The README currently indicates that an Ampere-architecture or newer GPU is required to use the vLLM backend, otherwise you're stuck with the torch backend.

As far as I can tell, this is not a vLLM-specific constraint - vLLM does not require that you use an ampere-arch device.

Motivation

I am trying to run OpenLLM on my Nvidia Tesla v100 (32GB) devices, but I cannot use the vLLM backend, as OpenLLM's vLLM backend does not support the volta architecture.

Other

I would love to help as best as I can, but I can't find any documentation for where this constraint comes from, other than in the README. I've gone through vLLM's docs, and they do not indicate that this is a vLLM constraint.