bd-iaas-us / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
4 stars 1 forks source link

[Feature]: End-to-end run for Triton on vllm #36

Closed chizhang118 closed 5 days ago

chizhang118 commented 1 month ago

🚀 The feature, motivation and pitch

The end-to-end scenario we want to support with the first milestone:

Alternatives

No response

Additional context

No response

Before submitting a new issue...

chizhang118 commented 5 days ago

Verified the CUDA free. Code ready in https://github.com/bd-iaas-us/vllm/tree/triton-npu-support.

Detailed comparison in doc https://bytedance.larkoffice.com/wiki/HHoiwbHbui1ZaKkYK3yclrXInvJ