Cache the source's response

lucemia commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Feature description

During the execution of Dependabot, it checks the PyPi server to retrieve metadata information. Some of these requests appear to be largely redundant. For instance, the following line will initiate a request to the PyPi server each time it is called.

https://github.com/dependabot/dependabot-core/blob/7e90f1c86f9da0011397734f28c69637b8c470b8/python/lib/dependabot/python/update_checker.rb#L261-L278

The log patterns looks like:

🌍 --> GET https://pypi.org/pypi/[your-package-name]/json/
🌍 <-- 301 https://pypi.org/pypi/[your-pacakge-name]/json/
🌍 --> GET https://pypi.org/pypi/[your-package-name]/json
🌍 <-- 404 https://pypi.org/pypi/[your-package-name]/json

...

🌍 --> GET https://pypi.org/pypi/[your-package-name]/json/
🌍 <-- 301 https://pypi.org/pypi/[your-pacakge-name]/json/
🌍 --> GET https://pypi.org/pypi/[your-package-name]/json
🌍 <-- 404 https://pypi.org/pypi/[your-package-name]/json

The same request for the same package can be repeated tens or even hundreds of times. Dependabot could significantly save time and bandwidth by caching the result.

I am willing to implement this feature if the maintainers think it is ok

abdulapopoola commented 1 year ago

Thanks for raising this @lucemia ; yes we would love it if you could help with implement this. Thank you so very much!!

lucemia commented 12 months ago

Thanks for raising this @lucemia ; yes we would love it if you could help with implement this. Thank you so very much!!

I ran an experiment and found that caching can save about 13.8% of execution time. Here is my test case:

Checking for updates 19 times
Cache HIT 84 times
execution time 7m25.504s (before caching) -> 6m24.042s (after caching)

I'm currently considering two ways to do caching, and any advice or input is welcome.

cache the library?
add cache layer in RegistryClient

cache the `library?`

Consider caching only the 'library?' result. In my test case, it seems to be the primary reason for repeated PYPI requests.

https://github.com/dependabot/dependabot-core/blob/7e90f1c86f9da0011397734f28c69637b8c470b8/python/lib/dependabot/python/update_checker.rb#L261-L278

However, since an UpdateChecker instance is created anew for each "checking for updates" session, I probably need to include a class variable within UpdateChecker to store the results across sessions.

add cache layer in `RegistryClient`

Another viable option is to implement "full response caching" as mentioned in the RegistryClient. This approach is more clean and could also offer benefits to other ecosystem facing similar situations.

https://github.com/dependabot/dependabot-core/blob/9773166774876d2bc266b948e787b69e5023b386/common/lib/dependabot/registry_client.rb#L9-L11

jeffwidman commented 11 months ago

Aren't these requests already cached within the Dependabot Proxy that GitHub runs internally? I recall @brrygrdn added a bunch of caching for exactly this scenario on various ecosystems' package registries, but my memory is hazy on where exactly it got added (probably due to the lack of sleep over the past few months with a new baby).

Adding caching at this layer would make it accessible to folks running Dependabot standalone--which would be awesome--but might increase complexity during debugging for the GitHub team.

I know there were some very preliminary internal discussions about possibly open sourcing that proxy, mostly it was "we should take a look at whether we can safely open source the proxy"... but if that was possible, that'd probably simplify the story here a lot because it'd unify the user experience between Dependabot standalone and Dependabot-as-run-by-GitHub....

@mctofu maybe you have some thoughts here?

lucemia commented 11 months ago

Does the proxy server cache HTTP 404 responses as well? I also noticed that timeout errors are cached separately.

https://github.com/dependabot/dependabot-core/blob/9773166774876d2bc266b948e787b69e5023b386/common/lib/dependabot/registry_client.rb#L29-L44

dependabot / dependabot-core