Proposal: Local curl request caching

russmatney commented 2 years ago

Background

Neil makes use of the github api when checking for the latest version, in both the neil dep add and neil dep upgrade flows (and likely others). Github has a default rate-limit of 60 requests per hour - after that, requests for latest commits will return a 403. Github rate limit docs here.

This ends up nil-punning in the current code, eventually leading to a misleading "no remote found for lib" error here - improving this would be some low-hanging fruit, and probably the least amount of work and complexity.

This has been discussed in a few other places:

https://github.com/babashka/neil/issues/111 in particular proposed a solution that local devs and end users can take advantage of, by creating and setting a personal access token as an env var.
https://github.com/babashka/neil/pull/127 some further discussion happened here (I hadn't seen the previous issue yet)

the (potential) problem

This rate limit could be a problem to local dev and testing, especially as the tests grow. The upgrade tests that merged this week now make quite a few network calls (6 to github, from my read of them) - this makes it much easier to hit the rate limit even when working on just those tests.

It's possible for neil's end users to hit this limit as well. A project with 31+ git dependencies (which is extreme) would only be able to update their deps once per hour, assuming neil is the only tool they are hitting the github api with - the rate-limit is also ip-based, so the 60 hits per hour could be shared. This means if you undo your upgrade, then try to redo it, it will fail. Definitely an extreme case!

a potential solution: a local cache

One way to deal with this could be with a local cache that cuts off repeated web requests. Repeated calls to neil dep upgrade could return without making any web requests (after the first call). Users should be able to clear the cache themselves, but otherwise updates to libraries are rare enough that caching for some amount of time (an hour) seems fine.

This would speed up the tests, and mostly dodge this whole rate limit thing for end users, unless they forced it.

Caching introduces complexity that neil doesn't have to deal with at all right now - it could create some unexpected behavior for users or devs, if it's not clear there's a caching layer. It also adds another artifact/bit of state to manage, because it's another whole file to read/write from. So, definitely some tradeoffs/drawbacks.

I implemented something like this on a personal project, and had good success with an in-memory cache that was just a map with request parameters as keys and the response object as a value, so that's the direction I'd recommend going:

;; something like ~/.config/neil/request-cache.edn, a map
{;; the keys are the request params we would pass to curl
 "<request-url>"
 ;; the values are a tuple of request-time and response-object
 [request-timestamp response-parsed-json-blob]}

Actual Impact?

Really, I don't think this is too critical to neil end-users at the moment. I'm motivated to address it because I introduced a number of github requests in the PRs that merged this week, and developers on this project may now be forced to use the new env vars to work with the repos tests.

Working on neil offline is not really possible - this solution would make it reasonable for the tests to completely mock requests with some fixtures... but that's probably not too relevant, as neil is definitely not an offline tool.

TLDR

Sorry for the short novel!

This is all a long way of saying we could cache/memoize the curl-get-json function in neil/curl.clj. That feels like a good spot to cut off requests that we don't need to make repeatedly, when the result is the often the same (unless you just released a new version, in which case we should provide a cache-clearing command).

It'd require reading/writing from a cache file somewhere, so we'd need to decide where we're ok with that file living. This would also require some new commands and help text to make it clear to users that a cache is in play.

In the meantime, devs and users can take advantage of the new env vars, once they setup a personal access token.

borkdude commented 2 years ago

Wasn't this problem already addressed by using a GITHUB_TOKEN?

russmatney commented 2 years ago

Yes - the github token takes care of it, if we're ok with every dev creating/using one to run the tests, and similarly if end users run into this, they can also create and set a token themselves (if we document it).

To me, this is a route to not requiring that of neil devs and users.

The actionable thing (if we don't want to go for a local cache) is probably improving the error handling when github 403s b/c of the rate limit.

russmatney commented 2 years ago

This could also speed up the tests somewhat - I'm not sure exactly why they're slow, but if it's b/c of the network requests, this would be a route to getting those to something more reasonable.

But, feel free to close if there's no interest - it's not a pressing need or requirement for neil by any means, was just similar to a problem i'd seen recently, so i thought i'd share.

borkdude commented 2 years ago

I haven't noticed problems as a user. For a developer, I think we have the token solution. Let's not complicate things if we don't have to: caching is hard.

russmatney commented 2 years ago

sure, seems fine - i appreciate the commitment to keeping the complexity down! hopefully we never have to deal with this

teodorlu commented 2 years ago

I also think it's useful if we discuss end user needs separately from developer needs.

Another way to frame the developer side is that tests are currently:

Not possible to run if my network is down
Are not idemponent / immutable. API outages or remote API changes or API usage limits can cause flaky tests.

I see caching as one potential solution. Other solutions to "better tests" are probably worth considering too.

Personally - being able to provide a local Github token solves 99 % of my current needs. The remaining 1 % is mostly aesthetics 😄

russmatney commented 2 years ago

It occurred to me that an in-memory cache (dropping the external file bit) would solve the repeated-requests-in-tests problem, because the memory is shared across tests.

I did a quick proof of concept here: https://github.com/russmatney/neil/commit/fb5d7180d43e4a9541e4c53108b15a4506d138d8

This allows the tests to grow without adding more request burden - if they all work with (for example) :lib clj-kondo/clj-kondo, the whole suite would only make a handful of requests per run (vs new requests per test).

No worries pressure to pick this up - just an idea/starting point to consider if we ever want something like this. No worries on dropping it for now, just had to get the idea out.

russmatney commented 2 years ago

I'll close this, as it's not a real problem for anyone at the moment. I opened a small docs PR mentioning the env vars in the Contributing section - that might help head off the confusion for new contributors: https://github.com/babashka/neil/pull/129

babashka / neil