llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.
This is a more complicated pr...
The previously implements has a issue: cannot run with tensor_parallel_size > 1, means cannot launch with vllm with multiple GPUs.
The reason is vllm itself leverage ray to handle multiple GPUs internal, that's means vllm addressed above ray,
in contrast, llm-inference put ray above inference implements(vllm, deepspeed, pytorch .ect)
That's caused the issue(cannot run with more than one GPU, remember?)
This is a more complicated pr... The previously implements has a issue: cannot run with
tensor_parallel_size
> 1, means cannot launch with vllm with multiple GPUs. The reason isvllm
itself leverageray
to handle multiple GPUs internal, that's meansvllm
addressed aboveray
, in contrast,llm-inference
put ray above inference implements(vllm, deepspeed, pytorch .ect)That's caused the issue(cannot run with more than one GPU, remember?)