Roadmap - Githubissues

tjtanaa commented 1 year ago

Port vllm/main feature to ROCm

[x] Support Llama/Llama-2 models for v0.2.x
[x] Support SqueezeLLM
[x] Support YARN
[x] Merge into upstream vllm (https://github.com/vllm-project/vllm/pull/1836)
[x] Look into supporting multi LORA on ROCm (https://github.com/vllm-project/vllm/pull/1804)
[ ] Support AQLM Kernel (https://github.com/EmbeddedLLM/vllm/tree/aqlm-rocm)
[ ] Support GGML Kernel (https://github.com/EmbeddedLLM/vllm/tree/ggml-rocm)
[ ] Support GGUF Quantization on ROCm (https://github.com/vllm-project/vllm/pull/10254)
[ ] Upstream Cross-Attention kernel to support Llama 3.2 Vision Model
[ ] Prompt https://github.com/LMCache/LMCache
- [ ] Add ROCm support to torchac_cuda
- [ ] Validate rocTX usage in Python

Benchmark
- [x] Real-world Distribution benchmark https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html
- [ ] Long-Input-Long-Output benchmarking.

HAN-oQo commented 11 months ago

Hi, @tjtanaa I wonder how the roadmap is going on. I quite excited to use AWQ quantized format, when can it be supported?

tjtanaa commented 11 months ago

Hi, @tjtanaa I wonder how the roadmap is going on. I quite excited to use AWQ quantized format, when can it be supported?

@HAN-oQo Hi, vLLM authors said they are working on more efficient AWQ implementation on triton. So, we will address the AWQ on ROCm after they have released their new kernel.

HAN-oQo commented 11 months ago

Thank you for answer! @tjtanaa I also wonder why safetensor format is not supported, and do you have a plan to support it!

Thank you for offering the nice project.

tjtanaa commented 11 months ago

Thank you for answer! @tjtanaa I also wonder why safetensor format is not supported, and do you have a plan to support it!

Thank you for offering the nice project.

@HAN-oQo The loading of safetensors is buggy on ROCm platform. The memory management during loading of safetensors might be causing the issue on ROCm platform. It often encounters this issue when tensor-parallelism is larger than 1; however, loading from pt is totally fine.

EmbeddedLLM / vllm

Roadmap #4