datopian / metastore-lib

🗄️ Library for storing dataset metadata, with versioning support and pluggable backends including GitHub.
https://tech.datopian.com/versioning/
MIT License
10 stars 1 forks source link

[optimization] GitHub: Reduce number of API calls by caching #16

Open shevron opened 4 years ago

shevron commented 4 years ago

There are a few API calls that we tend to make again and again in the lifetime of a backend object or even over the span of multiple instantiations in the same process, that could be cached using different approaches.

A good start is probably caching the result of GitHubStorage._get_owner() (there will probably be very few owners used in a given installation) and GitHubStorage._get_repo() which can return a bunch of different repos per installation (so maybe we can cache between instantiations but using some kind of space limited MRU cache), but even caching it on the instance level in some space limited manner can be helpful as there are many calls which call it multiple times for the same repo.

shevron commented 4 years ago

Note that we should probably look into #24 first before adding the complexity of a cache.

That said, there are cases where owner / repo can just be passed around between methods of the same instance that call eachother and not fetched if they are already in the calling scope. These really do not add complexity and do not require "caching" in the classic sense, and should be implemented. One example is when update calls fetch - update already has the repo object and does not need to call these APIs again, just pass the object to fetch.

shevron commented 4 years ago

repo object now passed around between internal calls (done in #25).

This is now about further caching of repo / owner / user objects, which adds complexity so I consider this to be lower priority now.