EmbeddedLLM / vllm

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
https://vllm.readthedocs.io
Apache License 2.0
89 stars 5 forks source link

Roadmap #4

Open tjtanaa opened 1 year ago

tjtanaa commented 1 year ago
  1. Port vllm/main feature to ROCm
  1. Benchmark
HAN-oQo commented 11 months ago

Hi, @tjtanaa I wonder how the roadmap is going on. I quite excited to use AWQ quantized format, when can it be supported?

tjtanaa commented 11 months ago

Hi, @tjtanaa I wonder how the roadmap is going on. I quite excited to use AWQ quantized format, when can it be supported?

@HAN-oQo Hi, vLLM authors said they are working on more efficient AWQ implementation on triton. So, we will address the AWQ on ROCm after they have released their new kernel.

HAN-oQo commented 11 months ago

Thank you for answer! @tjtanaa I also wonder why safetensor format is not supported, and do you have a plan to support it!

Thank you for offering the nice project.

tjtanaa commented 11 months ago

Thank you for answer! @tjtanaa I also wonder why safetensor format is not supported, and do you have a plan to support it!

Thank you for offering the nice project.

@HAN-oQo The loading of safetensors is buggy on ROCm platform. The memory management during loading of safetensors might be causing the issue on ROCm platform. It often encounters this issue when tensor-parallelism is larger than 1; however, loading from pt is totally fine.