Open furlat opened 4 months ago
Could be worth to try a modal-deployment of the lmm server with modal as well https://modal.com/docs/examples/vllm_inference
interesting pr for vlmm with respect to speculative decoding https://github.com/vllm-project/vllm/pull/2188 and fused moe kernels https://github.com/vllm-project/vllm/pull/2913 https://github.com/vllm-project/vllm/pull/2979
The neural network architecture used as the language model will be Mixtral. The server must meet the following requirements:
Structured extraction using Pydantic. Efficient batching in order to repeat the same task on thousands of different documents. Use of a common prefix system to reuse the KV cache regarding the system prompt and other shared prefixes. Speculative decoding, such as prompt n-gram caching, which allows the model to make suggestions with the text already present within the input. The ability to use quantized models, as we should be able to use Mixtral with a budget of 2 GPU 4090s per server, for a total of 48 GB of VRAM per server. Use of FastAPI to serve the inference server to other machines within the VPN/LAN of the servers.
The goal and evaluation will be the speed in terms of reading and writing of the inference server. In particular, we are interested in knowing how much time it takes to read and write one million tokens with the same structured extraction task on about a thousand documents in parallel.
---> also on Modal
https://github.com/ollama/ollama https://github.com/abetlen/llama-cpp-python https://github.com/vllm-project/vllm