HeyPuter / puter

🌐 The Internet OS! Free, Open-Source, and Self-Hostable.
https://puter.com
GNU Affero General Public License v3.0
26.75k stars 1.81k forks source link

Implement Server-Side Caching for AI API Inference Calls #792

Open recursionbane opened 1 month ago

recursionbane commented 1 month ago

To optimize response times and reduce API costs for Puter (especially if we increase context limits), could we implement a server-side caching mechanism for AI API inference calls? The cache would have a short Time-To-Live (TTL) of approximately 7 days and would be shared globally (user-independent). Caching should be keyed off the hash of the prompt and the selected model to maximize reuse.

Additionally, it would be helpful to provide an option to control caching behavior. By setting a from_cache boolean parameter to false, users could bypass the cache and repopulate it with fresh results for the current request.

In the response body, including a from_cache field (as an epoch timestamp) would indicate when the inference result was last cached. This field can be omitted if the response is generated and not from the cache.

Cached responses should not count towards inference rate limits.

Thank you for considering this enhancement!

KernelDeimos commented 3 weeks ago

Hi, thanks for your report!

This is on our radar - we actually discussed this during the initial implementation. TTL cache is very easy, but memory management for TTL cache of various different things (AI, filesystem, etc) across a distributed system (like production Puter)... that's a little less easy.

We can probably implement a database cache for this, but that will immediately become the largest type of data we pull from the database and might have ramifications for that reason.

We could cache responses as files on Puter's filesystem. This is probably the approach we will take first. In the meantime, I'll keep this issue open.