Continuous batching with the Python API

EricLBuehler / mistral.rs

Blazingly fast LLM inference.

MIT License

4.44k stars 307 forks source link

Continuous batching with the Python API #503

Open nopepper opened 4 months ago

nopepper commented 4 months ago

Hello, sorry if this is not an issue per se, but I wasn't sure where else to put it. I couldn't find any info on whether all the performance features (including continuous batching, caching, etc.) are supported through the Python API. It seems like you can only call the Runner synchronously and with one request at a time. Am I missing something, or is batching only supported when using the server? Thanks.

EricLBuehler commented 4 months ago

Hi @nopepper! The only thing which is hard(er) to do is continuous batching. The prefix caching and all other features will work, but for continuous batching it is hard due to the GIL.

We will be releasing an async Python API soon (probably during the coming week, perhaps we can leave this issue open), and the Python API itself will be reworked internally for version 0.2.0 for a smoother experience.

EricLBuehler commented 4 months ago

@nopepper - beginning work on the async API!