EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3k stars 218 forks source link

Continuous batching with the Python API #503

Open nopepper opened 2 weeks ago

nopepper commented 2 weeks ago

Hello, sorry if this is not an issue per se, but I wasn't sure where else to put it. I couldn't find any info on whether all the performance features (including continuous batching, caching, etc.) are supported through the Python API. It seems like you can only call the Runner synchronously and with one request at a time. Am I missing something, or is batching only supported when using the server? Thanks.

EricLBuehler commented 2 weeks ago

Hi @nopepper! The only thing which is hard(er) to do is continuous batching. The prefix caching and all other features will work, but for continuous batching it is hard due to the GIL.

We will be releasing an async Python API soon (probably during the coming week, perhaps we can leave this issue open), and the Python API itself will be reworked internally for version 0.2.0 for a smoother experience.

EricLBuehler commented 4 days ago

@nopepper - beginning work on the async API!