fedora-copr / logdetective

Analyze logs using Language Model (LLM) and Drain template miner.
Apache License 2.0
10 stars 9 forks source link

Service can concurrently process multiple requests #82

Open TomasTomecek opened 1 month ago

TomasTomecek commented 1 month ago

This is a tracking issue for us to figure out for the service to process multiple requests in parallel "so users wouldn't notice" and we don't need to heavily invest into multiple GPUs

TomasTomecek commented 1 month ago

I started researching vllm since it's one of the engines in InstructLab.

In their readme they state (points related to this issue):

There is also a relevant performance benchmark

We can easily test out their OpenAI webserver in a container: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html This will be one of my next steps.

I love they have metrics baked-in so it will be easy for us to measure how many people use our service: https://docs.vllm.ai/en/v0.6.0/serving/metrics.html

One thing to check is the streaming response, there is an open PR over here: https://github.com/vllm-project/vllm/pull/7648

TomasTomecek commented 1 month ago

My writeup from the initial comparison of both vllm and llama-cpp: https://blog.tomecek.net/post/comparing-llamacpp-vllm/