Open lucemia opened 1 year ago
Thanks for raising this @lucemia ; yes we would love it if you could help with implement this. Thank you so very much!!
Thanks for raising this @lucemia ; yes we would love it if you could help with implement this. Thank you so very much!!
I ran an experiment and found that caching can save about 13.8% of execution time. Here is my test case:
I'm currently considering two ways to do caching, and any advice or input is welcome.
library?
RegistryClient
library?
Consider caching only the 'library?' result. In my test case, it seems to be the primary reason for repeated PYPI requests.
However, since an UpdateChecker
instance is created anew for each "checking for updates" session, I probably need to include a class variable within UpdateChecker
to store the results across sessions.
RegistryClient
Another viable option is to implement "full response caching" as mentioned in the RegistryClient
. This approach is more clean and could also offer benefits to other ecosystem facing similar situations.
Aren't these requests already cached within the Dependabot Proxy that GitHub runs internally? I recall @brrygrdn added a bunch of caching for exactly this scenario on various ecosystems' package registries, but my memory is hazy on where exactly it got added (probably due to the lack of sleep over the past few months with a new baby).
Adding caching at this layer would make it accessible to folks running Dependabot standalone--which would be awesome--but might increase complexity during debugging for the GitHub team.
I know there were some very preliminary internal discussions about possibly open sourcing that proxy, mostly it was "we should take a look at whether we can safely open source the proxy"... but if that was possible, that'd probably simplify the story here a lot because it'd unify the user experience between Dependabot standalone and Dependabot-as-run-by-GitHub....
@mctofu maybe you have some thoughts here?
Does the proxy server cache HTTP 404 responses as well? I also noticed that timeout errors are cached separately.
Is there an existing issue for this?
Feature description
During the execution of Dependabot, it checks the PyPi server to retrieve metadata information. Some of these requests appear to be largely redundant. For instance, the following line will initiate a request to the PyPi server each time it is called.
https://github.com/dependabot/dependabot-core/blob/7e90f1c86f9da0011397734f28c69637b8c470b8/python/lib/dependabot/python/update_checker.rb#L261-L278
The log patterns looks like:
The same request for the same package can be repeated tens or even hundreds of times. Dependabot could significantly save time and bandwidth by caching the result.
I am willing to implement this feature if the maintainers think it is ok