Service can concurrently process multiple requests

TomasTomecek commented 1 month ago

This is a tracking issue for us to figure out for the service to process multiple requests in parallel "so users wouldn't notice" and we don't need to heavily invest into multiple GPUs

TomasTomecek commented 1 month ago

I started researching vllm since it's one of the engines in InstructLab.

In their readme they state (points related to this issue):

State-of-the-art serving throughput
Continuous batching of incoming requests
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Tensor parallelism and pipeline parallelism support for distributed inference
OpenAI-compatible API server

There is also a relevant performance benchmark

We can easily test out their OpenAI webserver in a container: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html This will be one of my next steps.

I love they have metrics baked-in so it will be easy for us to measure how many people use our service: https://docs.vllm.ai/en/v0.6.0/serving/metrics.html

One thing to check is the streaming response, there is an open PR over here: https://github.com/vllm-project/vllm/pull/7648

TomasTomecek commented 1 month ago

My writeup from the initial comparison of both vllm and llama-cpp: https://blog.tomecek.net/post/comparing-llamacpp-vllm/

fedora-copr / logdetective

Service can concurrently process multiple requests #82