Open recursionbane opened 1 month ago
Hi, thanks for your report!
This is on our radar - we actually discussed this during the initial implementation. TTL cache is very easy, but memory management for TTL cache of various different things (AI, filesystem, etc) across a distributed system (like production Puter)... that's a little less easy.
We can probably implement a database cache for this, but that will immediately become the largest type of data we pull from the database and might have ramifications for that reason.
We could cache responses as files on Puter's filesystem. This is probably the approach we will take first. In the meantime, I'll keep this issue open.
To optimize response times and reduce API costs for Puter (especially if we increase context limits), could we implement a server-side caching mechanism for AI API inference calls? The cache would have a short Time-To-Live (TTL) of approximately 7 days and would be shared globally (user-independent). Caching should be keyed off the hash of the prompt and the selected model to maximize reuse.
Additionally, it would be helpful to provide an option to control caching behavior. By setting a
from_cache
boolean parameter to false, users could bypass the cache and repopulate it with fresh results for the current request.In the response body, including a
from_cache
field (as an epoch timestamp) would indicate when the inference result was last cached. This field can be omitted if the response is generated and not from the cache.Cached responses should not count towards inference rate limits.
Thank you for considering this enhancement!