b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

feat: naive cache. #75

Closed b4rtaz closed 1 month ago

b4rtaz commented 1 month ago

This PR introduces a simple cache in the api server. It speeds up the inference in a chat client like AnythingLLM by reusing already processed input tokens.

Also, this PR resolves the problem with added unwanted <|eot_id|> to the response.