Implement Server-Side Caching for AI API Inference Calls

HeyPuter / puter

🌐 The Internet OS! Free, Open-Source, and Self-Hostable.

GNU Affero General Public License v3.0

26.75k stars 1.81k forks source link

To optimize response times and reduce API costs for Puter (especially if we increase context limits), could we implement a server-side caching mechanism for AI API inference calls? The cache would have a short Time-To-Live (TTL) of approximately 7 days and would be shared globally (user-independent). Caching should be keyed off the hash of the prompt and the selected model to maximize reuse.

Additionally, it would be helpful to provide an option to control caching behavior. By setting a from_cache boolean parameter to false, users could bypass the cache and repopulate it with fresh results for the current request.

In the response body, including a from_cache field (as an epoch timestamp) would indicate when the inference result was last cached. This field can be omitted if the response is generated and not from the cache.

Cached responses should not count towards inference rate limits.

Thank you for considering this enhancement!

HeyPuter / puter

Implement Server-Side Caching for AI API Inference Calls #792