make some tests and choose an openai api compatible local llm server

furlat commented 4 months ago

https://github.com/ollama/ollama https://github.com/abetlen/llama-cpp-python https://github.com/vllm-project/vllm

furlat commented 4 months ago

Could be worth to try a modal-deployment of the lmm server with modal as well https://modal.com/docs/examples/vllm_inference

furlat commented 4 months ago

https://github.com/sgl-project/sglang

furlat commented 4 months ago

interesting pr for vlmm with respect to speculative decoding https://github.com/vllm-project/vllm/pull/2188 and fused moe kernels https://github.com/vllm-project/vllm/pull/2913 https://github.com/vllm-project/vllm/pull/2979

furlat commented 3 months ago

The neural network architecture used as the language model will be Mixtral. The server must meet the following requirements:

Structured extraction using Pydantic. Efficient batching in order to repeat the same task on thousands of different documents. Use of a common prefix system to reuse the KV cache regarding the system prompt and other shared prefixes. Speculative decoding, such as prompt n-gram caching, which allows the model to make suggestions with the text already present within the input. The ability to use quantized models, as we should be able to use Mixtral with a budget of 2 GPU 4090s per server, for a total of 48 GB of VRAM per server. Use of FastAPI to serve the inference server to other machines within the VPN/LAN of the servers.

The goal and evaluation will be the speed in terms of reading and writing of the inference server. In particular, we are interested in knowing how much time it takes to read and write one million tokens with the same structured extraction task on about a thousand documents in parallel.

---> also on Modal

Neural-Dragon-AI / Cynde

make some tests and choose an openai api compatible local llm server #7