Refactor the solution of vllm integration

OpenCSGs / llm-inference

llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.

Apache License 2.0

69 stars 17 forks source link

Refactor the solution of vllm integration #60

Closed depenglee1707 closed 7 months ago

depenglee1707 commented 7 months ago

This is a more complicated pr... The previously implements has a issue: cannot run with tensor_parallel_size > 1, means cannot launch with vllm with multiple GPUs. The reason is vllm itself leverage ray to handle multiple GPUs internal, that's means vllm addressed above ray, in contrast, llm-inference put ray above inference implements(vllm, deepspeed, pytorch .ect)

That's caused the issue(cannot run with more than one GPU, remember?)