Open Ericson2314 opened 10 months ago
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/2023-11-20-nix-team-meeting-minutes-105/35902/1
I'll start working on the benchmarks and post here as soon as I have some results.
Collect data to decide if fetching the git history can be optimized so that it becomes viable to have this as the only strategy for fetching git repos, or if it is better to make fetching history conditional via a flag.
The two variants that are raced against each other in this benchmark:
Variant 1
but also fetch the commit history (treeless), in order to be able to return revCount.Using the git cli tool:
Variant 1
requires 1 invocation:
Variant 2
requires 3 invocations:
Repo: nixpkgs
Remotes:
Tested operations (measured in seconds):
Test Machines:
Variant 2
relative to Variant 1
https://docs.google.com/spreadsheets/d/1yLo4XMCKj-4xZVHBz2z4BwxBWI76xjVe7gtNgaNG7co/edit?usp=sharing
nix build github:DavHau/nix-git-research#benchmark-fod -L
Fetching the commit history in an optimized way requires using a --filter
for git fetch
. Bitbucket, which was one of the tested providers does not support filtering, therefore git falls back to full cloning, leading to very bad performance just for getting the rev-count.
Even in the best cases, fetching the commit history slows down the fetching by at least a factor of 3.7.
For example, from a cloud VM fetching a nixpkgs revision from github takes 8 seconds, while fetching the commit history and computing the revCount takes an additional 24 seconds, which is a slowdown by factor of 4.
In the case of nixpkgs, the number of objects that need to be fetched for the history is far larger than the number of tree objects for a single revision. It's ~500k objects for the history vs. ~60k for the tree.
While fetching the tree gives relatively consistent results, fetching the commit history is fast on some runs, but much slower on others, even for the same provider. This was only experienced with github on the 5G connection, and not from within the cloud. It might be an outlier, but it might be possible that github doesn't serve all kinds of requests with the same priority.
My opinion after these observations is that we should go with approach 2/ii from above, meaning the history fetching becomes optional and with it the revCount attribute as well.
Always fetching the history eagerly as described in 2/i would impose:
shallowRev = true/false
argument to fetchTree which is false
by default in fetchTree and true
by default in fetchGit. (I think it is useful for this flag to have shallow
in its name, because that is what people will most likely search for in manuals).shallowRev = true
.After the nix team adopts and/or confirms the plan, I can start working on it.
EDIT: an alternative plan could be to first optimize the current case, which means adding filtering support for the commit history right away. If libgit2 supports this, then this might be easy to do and it already provides quite a significant performance gain in many cases.
Thanks for the detailed analysis @DavHau !
Pursuing 2.ii seems to be the right call indeed given your benchmarks (I didn't know about --unshallow
, that's pretty neat btw).
The roadmap makes sense too, although I wouldn't be too strict about the ordering of the steps (different people will probably have different priorities here).
I'm not sure about the shallowRev
name (at least it doesn't mean much to me), but that can be a separate dicsussion.
I was on holiday and will pick this up now.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/2023-12-15-nix-team-meeting-minutes-112/38155/1
Thanks to @DavHau for bringing up these issues in https://github.com/NixOS/nix/pull/9240 & https://github.com/NixOS/nix/pull/9376
TODO write problem
Tentative solution (two alternatives):
treeless
route if reasonably fast and supportedrevCount
supported because have commit historytreeless
clones locally rather than traditionalshallow
to not hit this restriction.revCount
)fetchTree
is norevCount
, but forfetchGit
for backwards-compat we have to keeprevCount
by default.--depth 1
fetchFromGithub
and friends if the performance is good enoughWith either version of (2) we never need to fetch any trees of other commits. The only question is whether we fetch all the commits and support
revCount
.