ROCm Port - Githubissues

EmbeddedLLM / vllm-rocm

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs

https://vllm.readthedocs.io

Apache License 2.0

83 stars 5 forks source link

ROCm Port #1

Closed kliuae closed 8 months ago

kliuae commented 8 months ago

Ported to ROCm

Included the hipify-ed cuda kernels in vLLM
Updated requirements setup.py to adapt to installation with ROCm
Changed the memory efficient attention forward method in xformers to using ROCm's flash-attention instead
Inclusion of a miniaturized xformers as a vLLM submodule for the interfacing of ROCm flash-attention
Updated parts of vLLM code interfacing with xformers
Added the explicit passing of the number of GPUs to ray initialization

Added quick start instructions

Updated readme
Added compatible dockerfile

Models tested

Llama 7b/13b/70b
Vicuna 7b/13b/33b

Co-authored by @tjtanaa @iAmir97 @tanpinsiang @meiyihTan