add support for vllm server via ssh

dongmingli-Ben commented 6 months ago

This PR adds vllm support via SSH connection. A VLLMRuntime is added and can be loaded like other runtimes via MultiNodeLoader. Several things are added for vllm (while none breaks current code):

a vllm_config field is added to GPUConfig to indicate the port for vllm server; defaults to None, meaning not using vllm
MultiNodeLoader accepts an additional argument when using vllm, i.e. enable_prefix_caching. This argument is not allowed when using sglang

An example of using the vllm runtime is in multi_node/benchmarks/bench_data_parallel_routing.py. The newly added test file multi_node/test_runtime.py also has examples of using vllm runtime.

About performance, for mistralai/Mistral-7B-v0.1, vllm is sometimes faster than sglang in terms of total time for all requests.

vikranth22446 commented 6 months ago

do you support running both sglang and vllm?

dongmingli-Ben commented 6 months ago

do you support running both sglang and vllm?

Right now it does not support running both yet because some of the args for sglang do not work with vllm and vice versa. One way to support both systems is to let sglang and vllm runtime to ignore args not specific to them.

dongmingli-Ben commented 6 months ago

@vikranth22446 Now both sglang runtime and vllm runtime will ignore arguments irrelevant to them. With this, now I can run sglang on GPU 0 and vllm on GPU 1. An example of this is in multi_node/benchmarks/bench_data_parallel_routing.py.

vikranth22446 commented 6 months ago

LGTM for now. Next step for cleaning this would be to maybe put the config directly inside the gpu config wrapper(extend for each), but I'll merge

WukLab / preble

add support for vllm server via ssh #42