erlang / rebar3

Erlang build tool that makes it easy to compile and test Erlang applications and releases.
http://www.rebar3.org
Apache License 2.0
1.71k stars 517 forks source link

RFE: cache git deps #1724

Closed fenollp closed 5 years ago

fenollp commented 6 years ago

Current behaviour

Git dependencies are not cached (somewhat cc #1281). Here jesse gets cloned to _build/.../jesse but ~/.cache/rebar3/ has no Git directory or any place where it would cache it.

       ,{jesse,
         {git, "https://github.com/for-GET/jesse.git",
          {tag, "1.5.0-rc2"}}}

Note that git plugins will be cloned to ~/.cache/rebar3/plugins/{plugin} but their .git/ will not be kept (more like they get git archived). This sounds like an issue for #1301.

Expected behaviour

src/rebar_git_resource.erl should be caching git dependencies.

Here's how I think a cache of git repositories for rebar3 should look like:

$ cd ~/.cache/rebar3/
$ ls
git  hex  plugins
$ tree git/
git/
└── [4.0K]  github.com/
    └── [4.0K]  manopapad/
        └── [4.0K]  proper/
            ├── [4.0K]  branches/
            ├── [ 131]  config
            ├── [  73]  description
            ├── [  23]  HEAD
            ├── [4.0K]  hooks/
            │   ├── [ 478]  applypatch-msg.sample*
            │   ├── [ 896]  commit-msg.sample*
            │   ├── [ 189]  post-update.sample*
            │   ├── [ 424]  pre-applypatch.sample*
            │   ├── [1.6K]  pre-commit.sample*
            │   ├── [1.2K]  prepare-commit-msg.sample*
            │   ├── [1.3K]  pre-push.sample*
            │   ├── [4.8K]  pre-rebase.sample*
            │   ├── [ 544]  pre-receive.sample*
            │   └── [3.5K]  update.sample*
            ├── [4.0K]  info/
            │   └── [ 240]  exclude
            ├── [4.0K]  objects/
            │   ├── [4.0K]  info/
            │   └── [4.0K]  pack/
            │       ├── [ 99K]  pack-2b2c1a3dc2d0046d6971f9a5227c7c713d5d3a99.idx
            │       └── [2.2M]  pack-2b2c1a3dc2d0046d6971f9a5227c7c713d5d3a99.pack
            ├── [ 679]  packed-refs
            └── [4.0K]  refs/
                ├── [4.0K]  heads/
                └── [4.0K]  tags/

12 directories, 17 files

That is: put git deps under ~/.cache/rebar3/git/{host}/{user}/{repo} as bare repos.

Now, if rebar3 tries to install a dep identified by its commit hash and we have the bare repo cached we can be sure to either have or not have that commit. With anything else than a hash we have to make sure our cached repo is up to date. In most cases however (provided a lockfile exists) that shouldn't happen.

Notes:

I have not looked into _checkouts yet so cannot comment on that. With this, I hope to receive you criticisms & ideas! Thank you

fenollp commented 6 years ago

WRT _checkouts:

Simply make a symlink or copy your dependency to _checkouts at the top level of your project.

No difference from my proposal then.

ferd commented 6 years ago

Yeah we only wanted to cache hex packages in the first place, since they're immutable. I don't believe there's any plan to cache git deps. The management and handling of these sounds trickier since not all repos will have all branches and refs for all projects, or that two projects using the same branch on the same repo can point at two distinct refs. The cache handling and invalidation there really sounds tricky and not fun. By comparison, hex packages are static, come with their own hashes, and can be re-validated against a single well-known index quite simply.

Plus git is all uncompressed by default. Git is just not on the roadmap for caching because we did not think it was a good candidate for that.

ferd commented 6 years ago

I should note that ref vs. tag/branch is only a problem on first fetches (or upgrades) since otherwise the lock file contains the ref itself and we can rely on that. So I guess you'd save some network access, but at the cost of possibly more storage space (we currently check out single branches when we can for example).

The other interesting question with using a cache is that we implicitly make git non-concurrent (can't run many builds at once in the same user account) since two parallel builds may try to alter the same cached repo to get the branch they need (unless git supports that?)

fenollp commented 6 years ago

I was thinking that caching bare repos would solve some of the trickiness: You can checkout multiple refs/branch/tags from the same single bare repo into different folders at the same time (as pointed above that uses git init). The only somewhat related issue I foresee here is when updating the cache: no more than one process per cached repo should be spawned running the fetch command. Haven’t tried it yet, maybe git handles this gracefully.

Yes basically once a lockfile exists refs can be checked out trivially. Hex packages are similar here. A tarball can even be fetched from a git hash with git-archive. The immutability properties are equal to pkgs (they both are only unusable when the remote/cdn disappears). I think storing the whole bare repo is more interesting though: _checkouts are easier, storage is maybe more optimized when multiple versions of the same repo are depended on.

Anyway I think my first point will help you see a solution. I will look into how to have git optimize storage further than what bare repos do.

fenollp commented 6 years ago

So while bare repos take less space than non-bare repos (because non-bare repos are by definition the .git + the worktree) I found only one way to ensure keeping the .git size down to a minimum in a portable manner: git-gc. I am surprised there doesn't seem to be anything more effective than that!

ferd commented 5 years ago

Covered through https://github.com/erlang/rebar3/pull/1844 -- I think it's not a bad idea to do it through git-specific flags like that rather than maintaining a stateful cache of git repos across versions.