Allow `git_repository()` to clone from a local look-aside cache.

bcsgh commented 1 year ago

Description of the feature request:

What I'd like is some principled application of these concepts by git_repository():

https://randyfay.com/content/reference-cache-repositories-speed-clones-git-clone-reference https://randyfay.com/content/git-clone-reference-considered-harmful https://www.alchemists.io/articles/git_metadata_cloning

tl;dr; allow using a slow git remote as an authoritative source of truth while diverting as much of the bandwidth to more local and faster git sources. (For all I know --disk_cache et al. do that, but observational evidence suggests that's not the case.) Both local disk and on-premise git servers would be desirable caching options.

Ideally this could show up as a string_list_flag() that can be set via ~/.bazelrc, --bazelrc or the like with valid git remote URLs. As much as practical, git_repository() would use them for data and only fall back on git_repository().remote when it needs to "page in" data to the caches that's missing (or non-authoritative). Importantly, changing or removing that URL list or making them inaccessible should not alter build results.

The longer version of the idea (i.e. my somehwat ignorant musings about how I'd go about trying to implant this) is some way to avoid re-downloading git repos across the internet every time a "clean" build is kicked off. Given git's implementation, this could in theory be done by using a git repo somewhere "local" that includes all the remotes bazel has been asked to clone from and then pulling the majority of data from that.

Someone who knows a lot more about git internals than I do could probably figure out how to use one remote to resolve the commit to clone and then another (priority ordered list of) remote(s) to actually fetch the content. A possibility (that would require the cache actually be local) would be to first resolve commits and fetch into a cache directory (adding new remotes as needed) before doing the normal fetch with that direct as a substituted remote. Another possibility would be some kind of shallow, blobless and/or treeless clone from the real remote, followed by filling in the trees/blobs/etc. from the local/cache remote (where possible) and only pulling blobs/trees from the real remote as a last resort (and pushing them to the local/cache for next time).

Side note: Much of the work needed to make this happen would overlap with allowing git_repository() to use a list of remotes for fault tolerance and load spreading. As long as .commit is used this wouldn't even have issues with consistency.

What underlying problem are you trying to solve with this feature?

CI builds (and any other processes that start clean/clean'ish builds) can spend a significant amount of time just fetching sources.

Which operating system are you running Bazel on?

any

What is the output of `bazel info release`?

any

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

n/a

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

n/a

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Wyverald commented 1 year ago

Most of the Bazel ecosystem does not use git_repository, in favor of http_archive. It's easier to cache, doesn't have a dependency on a git on PATH, and offers stronger guarantees on checksums (SHA256 vs SHA1). This doesn't mean this FR is invalid, just makes it much lower priority.

github-actions[bot] commented 4 months ago

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale.

bcsgh commented 4 months ago

This is still relevant.

Even if git isn't the most common remote, I suspect it could well be the most generally useful one (as in I suspect there's more code revisions that can't be accessed via http than that can't be accepted via git).

bazelbuild / bazel