ROCm Port - Githubissues

EmbeddedLLM / vllm

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs

https://vllm.readthedocs.io

Apache License 2.0

89 stars 5 forks source link

ROCm Port #1

Closed kliuae closed 1 year ago

kliuae commented 1 year ago

Ported to ROCm

Included the hipify-ed cuda kernels in vLLM
Updated requirements setup.py to adapt to installation with ROCm
Changed the memory efficient attention forward method in xformers to using ROCm's flash-attention instead
Inclusion of a miniaturized xformers as a vLLM submodule for the interfacing of ROCm flash-attention
Updated parts of vLLM code interfacing with xformers
Added the explicit passing of the number of GPUs to ray initialization

Added quick start instructions

Updated readme
Added compatible dockerfile

Models tested

Llama 7b/13b/70b
Vicuna 7b/13b/33b

Co-authored by @tjtanaa @iAmir97 @tanpinsiang @meiyihTan