EmbeddedLLM / vllm

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
https://vllm.readthedocs.io
Apache License 2.0
89 stars 5 forks source link

Compatible GPU architectures #14

Closed TheJKM closed 11 months ago

TheJKM commented 11 months ago

Hi, awesome work! I have a question about supported GPU architectures, and I couldn't find anything about it in the repo. All your tests seem to be done on the Mi210, which is a CDNA2 card. Does your vLLM ROCm port also work on different architectures, like RDNA 3 and RDNA 2, which are now supported by ROCm 5.7?

tanpinsiang commented 11 months ago

Thank you for your interest in our work. Due to the dependency on flash attention 2 the current version does not support RDNA3 and RDNA2 architectures.

However, we are developing a version that does not utilize flash attention 2. If this alternative version interests you, please reach out to me via email.

tanpinsiang commented 11 months ago

According to https://github.com/ROCmSoftwarePlatform/flash-attention/blob/flash_attention_for_rocm/setup.py#L215. The supported arch are CDNA2(MI200) and CDNA3(MI300).

TheJKM commented 11 months ago

Thank you for your quick answer! My primary interest is in benchmarking, so I'm happy to wait until the other version is available.