Minimal implementation of partial clones & promisor objects

GitoxideLabs / gitoxide

An idiomatic, lean, fast & safe pure Rust implementation of Git

Apache License 2.0

8.96k stars 308 forks source link

Minimal implementation of partial clones & promisor objects #1046

Open cesfahani opened 1 year ago

cesfahani commented 1 year ago

Summary 💡

Tracking a minimal implementation of partial clones and their corresponding promisor objects/remotes, as discussed in #1041.

What is a partial clone?

Partial clone is a recent(ish) feature of git that allows the client to fetch a subset of objects from a remote repository based on criteria (i.e. a "filter") of its choosing. The missing objects are referred to as "promisor objects", and are expected to be able to be provided by "promisor remotes" on-demand after the clone, as needed.

The most common use-case of partial-clone are where the client requests a clone with either no historical blobs (e.g. --filter=blob:none), or only historical blobs under some size threshold (e.g. --filter=blob:512k). Tree objects can also be filtered by a partial clone, however that use-case is far less common.

Lessons learned from git

Because partial clone was retrofitted into git, there are several performance gaps that have not yet been resolved. Operations like fetch and checkout behave exactly as one would expect - the missing objects are fetched in a single transaction with the remote. Other operations, such as blame and rebase, do not do this, and instead end up lazily fetching missing objects one at a time (each with a separate transaction to the remote), which significantly slows things down.

To implement partial clones efficiently, operations that traverse history and require inspecting the contents of blobs and trees need to:

Determine the set of object IDs needed by the operation (typically by walking a commit graph)
Fetch any missing objects in a single transaction to the remote
Continue with their "business logic"

That said, this feature does not aim to implement the optimized approach to partial clones across the board. However we would like to see APIs designed to facilitate the optimized approach, and possibly one implementation of the optimized approach to be used as a reference and proof that things can be made to work as expected.

Tasks

[x] basic implementation of a connectivity check (via gix fsck connectivity)
- [x] create a text fixture(s) with missing blobs in promisor packs
- [x] create unit test(s) to verify connectivity check properly reports missing blobs
  - useful to confirm that promisor objects are correctly being identified
  - will also help (later on) confirm that the correct objects are fetched when a filter is provided
[ ] minimal implementation of fresh partial bare clone
- only support for blob filters needed here
- no checkout of working tree supported yet (bare clones only)
- [ ] set partial clone filter in newly created local config as remote.<name>.partialclonefilter
- [ ] set promisor field in newly created local config as remote.<name>.promisor
- set to true if a partial clone filter was provided
- leave unset otherwise
- [ ] plumb through support for partial bare clones to gix CLI
- [ ] create unit tests(s) to verify partial bare clones produce an identical packfile as produced by git
- [ ] create unit tests(s) to confirm partialclonefilter and promsior are set appropriately
[ ] make packfiles fetched from a promisor remote be "promisor packfiles"
[ ] checkout from a partial clone
- must respect the promisor and partialclonefilter config settings on the remotes
- must respect whether a packfile is a promisor pack
- should result in being able to do a fresh non-bare partial clone
[ ] fetch from a partial clone repository

Byron commented 1 year ago

Thanks a lot for putting this feature on the radar permanently, and for starting the discussion about it.

I'm not seeing anything like this [verification of blobs/fsck] today in gitoxide, but I may be missing something...

There are various verify() methods that mostly validate indices and their packs directly. I believe this is implemented for entire object databases as well. Since these aren't based on graph traversal, while ignoring .promisor packs, they probably wouldn't even fail. In order to get there, one might first want to implement a connectivity check which also isn't present right now (so gitoxide implicitly trusts packs produced by a server).

How object access works across crate boundaries and object access

gix (the crate) serves as hub for all gix-* plumbing crates. These themselves can refer to other plumbing crates but avoid doing so if possible. Thus, many crate boundaries are 'crossed' by passing closures which provide certain kinds of data, like git-attributes or object data. The latter typically results in impl FnMut(&gix_hash::oid, &mut Vec<u8>) -> Result<gix_object::ObjectType, E>.

Even though this represents the typical case of 'fetch one object at a time' which can be retrofitted quite easily, I'd recommend making this a shared trait that represents an object database. That way, multi-fetching facilities can be provided later. A great candidate for using these would be gix-worktree-state::checkout() which happens to know all of the required blobs in advance as it has access to the index. Please note that such traits exist both in gix-pack and gix-odb, with different levels of capabilities catered to their respective use-cases. A shared trait would live in neither of these crates but need a new one so it can be depended on by plumbing crates which previously avoided depending on gix-odb in favor of closures.

Of course, to get the index one has to build it from a tree, which now may have promisor trees on the way to the blobs/leafs, so building an index from a tree would have to be adjusted first.

All of the above implicitly assumes that fetches with filters work in the first place.

Efficiency

What matters most to me is that it's easy to determine if promisor objects are present at all to be able to bypass any kind of 'project-gathering-dry-run' unless it's known to be required. In other words, everything should stay as fast as it is without promisor objects. I think this shouldn't be a problem though as it's easy to determine if promisor objects are present in the ODB or not.

cesfahani commented 1 year ago

Thanks @Byron. Things are slowly starting to get pieced together in my head.

There are various verify() methods that mostly validate indices and their packs directly. I believe this is implemented for entire object databases as well. Since these aren't based on graph traversal, while ignoring .promisor packs, they probably wouldn't even fail. In order to get there, one might first want to implement a connectivity check which also isn't present right now (so gitoxide implicitly trusts packs produced by a server).

That sounds like a good starting point to me. After playing around with the various verify() methods, nothing seems to be doing a full walk of every commit and checking that all trees and blobs are reachable. I was able to confirm that with a partial clone, I can diff the enumerated entries from the HEAD and HEAD~1 trees and see the expected missing blob sizes (conveniently, this is available in gix CLI):

$ diff <(gix tree entries -r -e HEAD) <(gix tree entries -r -e HEAD~1)
331c331
< BLOB c8055d104958c490dd496f49dcd74d1f41dedbf0 3565 gix-attributes/src/state.rs
---
> BLOB 585e05adcdd9f4b57aa4f93b44dbeeb6ba760544 gix-attributes/src/state.rs
583c583
< BLOB 5e5a1e6e03f919f476ea4ae4ed7363da1e6eed42 8330 gix-url/src/parse.rs
---
> BLOB 8e1a57db1a04e81218c6a7f8ca5cf8cf5bdd5b9a gix-url/src/parse.rs
688c688
< BLOB f327cab204b22c7ead3e618a73d92704ef163dfa 16137 gix-attributes/src/search/outcome.rs
---
> BLOB 9ad8a81e1d771fdda65d9260e2d4bf120cd489c9 gix-attributes/src/search/outcome.rs
1104c1104
< BLOB c694334dddf4b8e32d8510fa05d5a5dd566bd807 4087 gix-url/tests/parse/mod.rs
---
> BLOB 1309fd133f29aa01b98cb17b65f034f481410f7c gix-url/tests/parse/mod.rs
1627d1626
< BLOB dd3b2c1d43d2f1af52444804bde68d990fd2c0fb 149713 gix-url/tests/fixtures/fuzzed/very-long2.url

I see three possible levels of connectivity checks, ranging from the simplest/fastest to the most complex/slowest (I'm ignoring some less-common ones for now):

The simplest implementation would accept an oid or rev-spec as an input, then perform the equivalent of git fsck --connectivity-only --no-dangling <rev-spec>. This would be relatively fast, as we're only walking a single ancestry tree, and don't have to worry about unreachable or dangling objects.
Next would be similar to the above, but with no "root" rev-spec explicitly provided. This would be the equivalent of git fsck ---connectivity-only --no-dangling, and would use the index, all refs, and all reflogs as heads.
Finally comes the equivalent of git fsck --connectivity-only, which (in addition to the above) also prints out all objects that are not directly reachable by any other object. This will be more computationally intensive and will require significantly more runtime memory, as we'd need to (I think?) maintain a table of unreachable objects, inserting into it as we traverse new objects, and pruning from it as we find object references. Looks like gix-hashtable already implements the required data structure, but because this operation needs to start with a set of all objects and then walk the ODB to determine reachability of each, optimizing it seems like it'll take some thought.

How about I start with just 1 & 2 above? Those seem relatively straightforward, and I believe they'd satisfy the immediate goal of making it easier to implement & test promisor objects. The third one can come later.

If that sounds reasonable, then my next question would be: where should this implementation go?? Wherever it ends up, it will need to utilize functionality from several existing crates (via the gix crate, as you described), but I'm feeling like this might warrant a new gix-fsck crate.

Byron commented 1 year ago

(conveniently, this is available in gix CLI):

That's clever - I forgot gix tree entries is even a thing 😅, especially one that can display the sizes of objects.

How about I start with just 1 & 2 above?

Do I understand correctly and the reason for the connectivity check to be required is that one has to create the promisor pack (or its index maybe) oneself based on the objects that have been omitted in the received pack?

(via the gix crate, as you described)

Maybe it's just me misunderstanding what you meant to say, but in general, gix-* crates can't ever reach 'up' to the gix crate. They may use each other though, with careful consideration. gix-fsck sounds like a crate one wants to have especially if git internally also implements connectivity checks using git fsck invocations.

When I thought about connectivity checks (independently of promisor packs or indices) I always thought that this should be integrated tightly with the current pack resolution machinery. It sees all objects in the packs and learns about their hashes (these have to go into the index after all), and it's perfectly multi-threaded. Ideally, with minimal performance loss, connectivity checks could be implemented. Since the index file will already contain the set of all objects, maybe it's possible to reduce the memory requirements by referring to a memory mapped version of it.

I know, I know, all that sounds like trouble and a separate walk-based implementation would be easier to do, but it would also be so much slower. Maybe with some thinking, connectivity checks can be built-into the pack resolution, but if not, doing it in a separate step seems alright as well.

cesfahani commented 1 year ago

Do I understand correctly and the reason for the connectivity check to be required is that one has to create the promisor pack (or its index maybe) oneself based on the objects that have been omitted in the received pack?

A connectivity check isn't required by any means, at least not for implementing promisor packs. I had thought you were recommending doing that as a starting point (which I do agree is a good idea) based on this comment:

In order to get there, one might first want to implement a connectivity check which also isn't present right now

I think this would definitely be helpful to have, as it would let me verify that the expected promisor objects are fetched/resolved during different operations on a partial cloned repository. Even before I have the fetch with filters implemented, a connectivity check would let me verify that gitoxide reports the same missing/promisor objects as git when run against a test fixture repo. I'd be looking to compare the connectivity check's result to something like:

git rev-list --objects --missing=print HEAD | grep -E '^\?\w+'

Maybe it's just me misunderstanding what you meant to say, but in general, gix-* crates can't ever reach 'up' to the gix crate.

Nope - this was totally my own misunderstanding of what you had said in your last comment about gix being "the hub" for all plumbing crates. With the help of cargo-depgraph, I was able to get a much better understanding of the relationship between various gitoxide crates. In case anyone's interested... cargo depgraph --workspace-only | dot -Gsplines=ortho -Tsvg > graph.svg - super cool stuff!

When I thought about connectivity checks (independently of promisor packs or indices) I always thought that this should be integrated tightly with the current pack resolution machinery. It sees all objects in the packs and learns about their hashes (these have to go into the index after all), and it's perfectly multi-threaded.

That sounds perfectly reasonable. I'll start digging into how this would best integrate into the current pack indexing & lookup architecture.

Since the index file will already contain the set of all objects, maybe it's possible to reduce the memory requirements by referring to a memory mapped version of it.

I thought that was already implemented here? But yes - mmap'ing the index files is definitely the right way to go.

Even so, the third connectivity check mode I mentioned in my last comment still has the added requirement of marking each object that is reachable/referenced from any other object, and then once the traversal is complete, it needs to report all objects that were not marked accordingly (i.e. "dangling objects"). While the traversal algorithm can (and should) certainly be highly parallelized, there is still a non-trivial memory requirement to maintain the data structure of dangling objects. git does this with the data structure instruct repository's parsed_objects. Each object in there has a flags field, and the connectivity check sets the REACHABLE bit as it traverses the repository. gitoxide can certainly do better, but I'm not sure there's any way to get around the memory requirement of needing to maintain a mapping of all objects to an indicator of whether or not they are reachable...

I'm not planning on addressing that use-case of scanning for dangling objects right now, as that's entirely unrelated to partial clones. But it will be something to consider once features like garbage collection come to head.

Thanks again @Byron! Please let me know if I've misunderstood anything, but otherwise I think I've got plenty of info to get me started.

Byron commented 1 year ago

I think there are a couple of things worth untangling, from my perspective:

connectivity checks aren't strictly needed to implement a first step towards partial clone support
algorithms who want to support promisor objects need to determine which of these they would run into in order to fetch them.
doing connectivity checks immediately when resolving a pack is a performance optimization, and one might argue that if connectivity checks are in any way useful, one would want to be able to do them outside of pack resolution as well.

With all that said, I am not convinced that connectivity checks integrated into pack resolution are a good idea, and if truly needed, they should be done as standalone capability. For tests or testing and verification it might also be possible to start out with just using git directly for that. It's your choice though.

It might seem I am changing my mind a lot, which might be due to my utter lack of understanding how git actually implements this capability. The current way of working is that you lead the feature and I support with gix specific knowledge, but I welcome you to share all you know and update the tracking ticket with links to resources and/or git source code as we progress, which should help me to catch up.

cesfahani commented 1 year ago

Apologies for the delay! I had some other stuff come up, but have also been spending time trying to better understand the intricate details of git partial clones, which I feel I have a much better handle on now.

Before I dive into what I've learned about partial clones in git, here's a quick taste of where I've gotten to in identifying missing/promisor objects in gitoxide, and how its performance compares to git. Note: Please ignore the fsck connectivity arguments to gix below. I had started out with a connectivity-check in mind, but since then I've just been using that subcommand as my playground for testing things out.

# Perform a blobless clone of gitoxide
cesfahani@diesel-tr:~$ git clone --filter=blob:none https://github.com/byron/gitoxide.git gitoxide-blobless
cesfahani@diesel-tr:~$ cd gitoxide-blobless

# Check number of missing objects with git
cesfahani@diesel-tr:~/gitoxide-blobless$ git rev-list --objects --quiet --missing=print HEAD | wc -l
45439
# Check number of missing objects with gix
cesfahani@diesel-tr:~/gitoxide-blobless$ gix fsck connectivity HEAD | wc -l
45439

# Yay! Now let's diff the (sorted) contents to double-check...
cesfahani@diesel-tr:~/gitoxide-blobless$ diff <(gix fsck connectivity HEAD | sort) <(git rev-list --objects --quiet --missing=print HEAD | cut -c2- | sort)

# Perfect match! Now let's check performance...
cesfahani@diesel-tr:~/gitoxide-blobless$ time git rev-list --objects --quiet --missing=print HEAD > /dev/null

real    0m1.210s
user    0m0.454s
sys 0m0.730s
cesfahani@diesel-tr:~/gitoxide-blobless$ time gix fsck connectivity HEAD > /dev/null 

real    0m0.551s
user    0m0.458s
sys 0m0.093s

This is great, but even better is the fact that I did zero of that fearless parallelism you mentioned in #1041 - my naive implementation is single-threaded and written by someone entirely new to gitoxide (and also mostly new to Rust). Parallelizing this should be straightforward and I expect it will significantly improve performance, but it's awesome to know that my dumb first-pass is already over 2X faster than git.

Now - about what I've learned about partial clones...

When we fetch a pack from a promisor remote, we end up with a .promisor file in addition to the usual .pack, .idx, and .rev files. I was initially a bit confused as to what the contents of this .promisor file were for... Interestingly, when doing a repack (e.g. via git gc -aggressive), I found the resulting single pack's .promisor file was empty, despite the original pack .promsior files having a bunch of object IDs in them.

If you read the first bullet of Handling Missing Objects, it says the promisor file has "arbitrary contents". Upon digging into the git source, I can confirm that the .promisor file is in-fact never actually read - only its presence is important. You can see here that when doing a repack, git will just always write out an empty .promisor file. I expect that some old version of git used the contents of these files for something, but I'm not sure what. The Footnotes describe a possible explanation.

In the same Handling Missing Objects section, the second bullet describes how git decides that an object is a promisor object. This aligns with the is_promisor_object function definition. This function lazily creates a static set of missing promisor objects for each pack. You can see in the add_promisor_object that it will walk each object in the pack, and find anything it refers to that is not in the pack. It designates each such object as a promisor object.

All this looks good, however it doesn't align with all the behavior I see in git. I've noticed this before, but only now did I dig into it. Brace yourself - you're not going to like it...

In git, if I attempt to do something like cat-file on a totally invalid/non-existent object ID, you'll notice it takes a bit longer than you'd expect. Add some tracing onto that, and you'll see the gross behavior:

cesfahani@diesel-tr:~/gitoxide-blobless$ GIT_TRACE=2 git cat-file -p 0000000000111111111122222222223333333333
20:12:02.891575 git.c:463               trace: built-in: git cat-file -p 0000000000111111111122222222223333333333
20:12:02.892446 run-command.c:659       trace: run_command: git -c fetch.negotiationAlgorithm=noop fetch origin --no-tags --no-write-fetch-head --recurse-submodules=no --filter=blob:none --stdin
20:12:02.895664 git.c:463               trace: built-in: git fetch origin --no-tags --no-write-fetch-head --recurse-submodules=no --filter=blob:none --stdin
20:12:02.896036 run-command.c:659       trace: run_command: GIT_DIR=.git git remote-https origin https://github.com/byron/gitoxide.git
20:12:02.898562 git.c:749               trace: exec: git-remote-https origin https://github.com/byron/gitoxide.git
20:12:02.898612 run-command.c:659       trace: run_command: git-remote-https origin https://github.com/byron/gitoxide.git
fatal: remote error: upload-pack: not our ref 0000000000111111111122222222223333333333
fatal: Not a valid object name 0000000000111111111122222222223333333333

This unfortunate behavior is explained by the third bullet in Handling Missing Objects... At the end of that bullet, you'll find:

For efficiency reasons, no check as to whether the missing object is actually a promisor object is performed.

I understand having to resort to this behavior if I had used a partial clone filter that filtered trees in addition to blobs, butgit knows that I specified a blobless clone and not a treeless clone. Either way, the pack index will not contain any reference to the missing promisor objects, so the only way to know whether or not they are "promised" (short of asking the remote for them) is to see if anything in the promisor packs reference them. A big question remains as to how efficient it could be in gitoxide to go check if an object ID is referenced by any promisor pack, and whether doing so would be more or less efficient than just asking the remote for the object.

@Byron I'd love to get an opinion on this. Right now my thinking for a minimal implementation is :

Implement the filter fetch option
Expose knowledge of whether or not a pack is a promisor pack
Allow a set of missing objects to be built-up (on-demand) for a promisor pack
- This will only ever be invoked if we've failed to locate an object in each pack's index (after searching loose objects), and only if the pack is a promisor
- Overhead on repositories without promisor packs should be near-zero (just a check for a .promisor file in each)
When searching for an object in gix-odb, we'll first search the indexes of each pack (as done today), but if & only if the object isn't found, we'll (in parallel) go search for missing objects in each promisor pack (if any exist)
- For now (hopefully forever), no lazy/dynamic one-off fetching of missing objects

Byron commented 1 year ago

Thanks so much for the update - exciting!

I love that you already cooked up a version of the connectivity check and that it apparently runs very fast, too! Too fast, if you ask me, as usually often-run algorithms in git are very well optimized so that even being as fast as git can be hard at times. Unless, of course, one finds a better algorithm or in absence of that, it's one of the few inefficient implementations.

When looking at the times, I find both very slow anyway, as this is the gitoxide repository which is a speck of dust compared to what I'd look at when comparing the performance. My recommendation is to also try consecutively larger repos and see if the performance benefit remains.

Missing Object Handling

For efficiency reasons, no check as to whether the missing object is actually a promisor object is performed.

I think it's easy to do better than that as this could simply be (more) configurable. Gitoxide already as its own gitoxide configuration section for that. I am thinking to have multiple modes with different performance characteristics. It's also possible to change settings per handle to the object database - this is used for instance to determine whether or not to re-read the object store to get see new packs if an object wasn't found. However, in this case I think what they do is reasonable, as they are assuming that if an object doesn't exist, it's a promised object which during normal operation should be the case. And if one doesn't expect that for whichever reason, it would be trivial to change the configuration of the object handle before potentially hitting a lot of nonexisting objects.

About potential efficiency

Integrating a set of promisor objects for lookup wold need to be done in the belly of the beast which is the threadsafe part of the object database. That way, these objects can be shared among all handles, and one typically has a handle per thread at least.

Updating promisor objects lazily is the way to go, and it should be possible to integrate it well with everything else. But it's going to be tricky as the object database is a lawless zone where borrowchk has no power 😅 (and introducing races is easy unless great care is taken). But it's far from impossible as well.

An alternative implementation would be to keep the promisor dataset in the handle itself so each thread would have to repeat the work necessary to build and maintain it - that's easy to implement as it's borrowchecked, but I guess one should try to do it properly first.

Next Steps

I seem to be missing a plan to integrate gix fsck connectivity - it's OK if it's a bit hacky even as gix is a developer tool. Sometimes code starts out as subcommand and then 'graduates' into gix (crate), which is a path I can imagine for the connectivity check as well.

Allow a set of missing objects to be built-up (on-demand) for a promisor pack

That seems reasonable, I just wanted to add that once an object is missing, one has searched all packs. Any amount of these could be promisor packs. For perfect laziness one would be able to tell the underlying shared ODB to find the given object in one of the promisor packs, whose 'promised' objects will be loaded on demand until the object was found. These promised objects would be kept in a shared read-only cache to accelerate future access. Of course that also complicates the implementation and I can imagine to start out with "load all promised objects, then check for membership" as well.

& only if the object isn't found, we'll (in parallel)

(emphasis mine)

Probably not in parallel as the ODB isn't allowed to spawn threads yet. It works by using the 'power' of the thread that uses the handle, and the current implementation can leverage this as to speed up operations if a trampling horde comes in and needs to load packs - these will be loaded in parallel (if I remember correctly). But at first, I don't think there is a need to be super efficient, it's enough to have a single thread do the work and do so correctly. Trust me, the ODB core is a nightmare as borrowcheck isn't present and it's highly concurrent - things should be kept as simple as possible.

With that said, 'as simple as possible' might also mean to do it like git - assume a missing object is promised as long as there is a single promisor pack in the known set of packs. That seems like a fair assumption and would be my starting point - it does safe the whole connectivity check business and would allow to keep the ODB core mostly unchanged - it just needs a new flag to check if a promisor is there or not, and that's easy to implement correctly.

However, implementing the dynamic fetching of objects on such a way that it is thread-safe and correct and ideally allows multiple threads to fetch objects at the same time, that's going to be an interesting problem to solve efficiently. Getting a first version done should be relatively straightforward though as each thread (in the code of the handle) could just initiate their own connection and operate without shared state. It's just horribly inefficient to do so… fetch a pack with one object, then refresh the ODB to pick up the object, then search that pack specifically if you can or re-search all packs. Horrible…for a single object.

You know, I think that even if it was implemented, having this in place will make any ordinary and unaware algorithm so painfully slow that people wouldn't want to wait around for the result anyway. And they'd end up with hundreds of packs of 1 object as well (unless, maybe, one explodes these into a loose objects, that's fair and probably has to be done to be viable).

Somehow I feel that on-demand loading of single objects shouldn't even be done, but instead one should make sure that each algorithm can deal with promisor objects. A graph-walk, of course, would be in trouble as it has to discover everything, but maybe that's a special case, too?

With all that said, please feel free to digest this into a potentially new (or changed) list of things you'd want to implement.

cesfahani commented 1 year ago

I love that you already cooked up a version of the connectivity check and that it apparently runs very fast, too! Too fast, if you ask me, as usually often-run algorithms in git are very well optimized so that even being as fast as git can be hard at times.

I should clarify one thing here. My first attempt at doing this yielded results like so:

cesfahani@diesel-tr:~/gitoxide-blobless$ time gix fsck connectivity HEAD > /dev/null

real    0m2.355s
user    0m0.892s
sys 0m1.445s

This is twice as slow as git. I then ran this with flamegraph, and noticed the large amount of time spent in consolidate_with_disk_state, which led me to Handle::refresh. I then added a repo.objects.refresh_never() at the top of my connectivity check, and that got me to the performance I showed earlier (twice as fast as git). I'm not sure if this is cheating or not... I had expected that git wouldn't be doing any sort of auto-refresh from disk during operations that walk a commit graph like this, but maybe I'm wrong? When dealing with promisor packs, refreshing from disk when an object is missing is a performance-killer (as seen here), as that is fully expected to happen regularly.

When looking at the times, I find both very slow anyway, as this is the gitoxide repository which is a speck of dust compared to what I'd look at when comparing the performance. My recommendation is to also try consecutively larger repos and see if the performance benefit remains.

I also did some tests using one of the larger repositories my company has. It's proprietary, but I can assure you it is quite big. The bare repository consumes 11.5GB and includes well over 100K commits. Here's what things look like with that:

cesfahani@diesel-tr:~/proprietary-repo$ time gix fsck connectivity HEAD | wc -l
241162

real    0m6.702s
user    0m6.151s
sys 0m1.377s
cesfahani@diesel-tr:~/proprietary-repo$ time git rev-list --objects --quiet --missing=print HEAD | wc -l
241162

real    0m7.187s
user    0m2.940s
sys 0m4.133s

In that same repository, commenting out my call to repo.objects.refresh_never() makes things quite a bit slower:

cesfahani@diesel-tr:~/proprietary-repo$ time gix-r fsck connectivity HEAD | wc -l
241162

real    0m16.402s
user    0m8.420s
sys 0m9.136s

I'd love any insight you may have on the conditions in which git refreshes packs & indexes from disk during a single operation, but otherwise I'm happy to dig into it.

However, in this case I think what they do is reasonable, as they are assuming that if an object doesn't exist, it's a promised object which during normal operation should be the case. And if one doesn't expect that for whichever reason, it would be trivial to change the configuration of the object handle before potentially hitting a lot of nonexisting objects.

The more I experiment with this, the more I agree. As you said, controlling this via configuration would be ideal. I can imagine a different git implementation (gitoxide?) where users can disable automatic contacting of the remote for certain subcommands, and instead require them to opt-in to such behavior. This would also be helpful for incrementally implementing optimized algorithms that are promisor-aware.

Take, for example, git blame. Right now, if you run this on a blobless repository, it will go and (painfully) fetch each blob in the tree's history, one at a time. Each will end up with it's own pack file, which will typically end up causing gc thresholds to be exceeded, so you'll get background gc happening in the midst of all this chaos.

Instead of git blame foo.c, if you were to run the following, then you'd end up with the behavior you really intended (a single fetch of all missing objects as a single pack, and then the blame):

git rev-list --objects --quiet --missing=print HEAD -- foo.c | cut -c2- | git fetch --no-write-fetch-head --refetch --stdin origin
git blame foo.c

I can understand some users dealing with smaller repositories being okay with the default behavior, but being able to configure different behavior would be fantastic IMO. Hopefully all algorithms can be made to be promisor aware, in which case this is a non-issue.

Updating promisor objects lazily is the way to go, and it should be possible to integrate it well with everything else. But it's going to be tricky as the object database is a lawless zone where borrowchk has no power

My core background is mostly C++, so the borrow checker is a new-found luxury to me. I'm more than happy to have a go at doing it properly first. If that proves to be too complicated, then I can fall back to a per-thread (Handle) set of promisor objects.

I seem to be missing a plan to integrate gix fsck connectivity - it's OK if it's a bit hacky even as gix is a developer tool.

Happy to hear that! Let me go ahead and polish up what I have (e.g. make it not panic upon any sort of error condition) and then I'll get an MR up.

For perfect laziness one would be able to tell the underlying shared ODB to find the given object in one of the promisor packs, whose 'promised' objects will be loaded on demand until the object was found.

Yesssss - I like this very much!

Somehow I feel that on-demand loading of single objects shouldn't even be done, but instead one should make sure that each algorithm can deal with promisor objects.

I 100% agree here, and my immediate use-cases have an anti-goal for that sort of lazy & inefficient fetching of one object at a time. I'm more than happy to remove that from the task list.

As described in my git blame example above, if a user attempts to perform an operation that is going to end up slowly and inefficiently fetching single-object-packs (because the algorithm isn't promisor-aware), I'd prefer them to just be told that at least one object is missing (assuming they have a promisor remote configured), and possibly be given a hint as to how to remediate the situation. This would, of course, be a stop-gap until various algorithms can be adjusted to be aware of the promisor scenario, but I think that's perfectly fine.

With all that said, please feel free to digest this into a potentially new (or changed) list of things you'd want to implement.

I'll update the task list in the original issue description now, but I'm also looking forward to continuing to iterate on ideas!

Thanks again @Byron. I fully understand how much effort it takes to ramp up a new (prospective) contributor to a software project, and I'll do my best to make sure you haven't wasted your time!

Byron commented 12 months ago

I'm not sure if this is cheating or not... I had expected that git wouldn't be doing any sort of auto-refresh from disk during operations that walk a commit graph like this, but maybe I'm wrong?

There is no cheating, and turning off automatic refresh on miss seems like the right thing to do. I'd expect git to do the same.

Looking at the distribution of user and sys times I notice that gix spends too much time in userland compared to git, and git spends suspiciously large amounts in sys land. There might be another opportunity for optimization there, who knows. It's hard to say without seeing the code, and I am looking forward to that PR.

I'd love any insight you may have on the conditions in which git refreshes packs & indexes from disk during a single operation, but otherwise I'm happy to dig into it.

I don't think there is any need if it seems to work and is faster, it's a win I am happy to (tentatively) take :D.

Hopefully all algorithms can be made to be promisor aware, in which case this is a non-issue.

I think gitoxide is in a great position to do it right™️ from the beginning, and for many commands it shouldn't be too hard if they'd have a way to get their working set of objects beforehand. However, and while blobless repos seem easiest, treeless repos appear to pose their very own problem that I'd probably simply ignore. If trees are missing and you have to deal with path/to/foo.rs, there is no other way but to painfull retrieve a tree at a time (root, path, to, finally the blob foo.rs). Maybe… this works by getting the entire tree recursively though, so it a blame that can deal with treeless and blobless repos would 'simply' have to deal with two passes, the first one gets all the trees, the second one gets the blobs that it actually needs. Somewhat involved, but doable, even though I always see it as something implemented on top. Maybe that 'on top' will end up being a set of utilities with common algorithms that can be mixed and matched, let's see.

If that proves to be too complicated, then I can fall back to a per-thread (Handle) set of promisor objects.

I just repeat this here to say that I think we share the understanding that the ODB won't ever auto-fetch promised objects, but that there may be an API (not necessarily on the ODB though) to fetch a set of objects and deal with promisor-related things as needed. That way, it will only work with promisor-aware algorithms, but that's fine, while I'd expect 'the machinery' to upgrade all object not found errors to also include a hint in case there are promisor packs present. Again, this is based on the assumption that fetching objects one-at-a-time is a non-starter anyway, while seriously complicating the already complex implementation that is the lock-free ODB implementation we have today.

I'll update the task list in the original issue description now, but I'm also looking forward to continuing to iterate on ideas!

I took a look (and updated the formatting a little) and would like to point out that I don't think any time should be spent on obtaining a set of promised objects as it seems sufficient to assume every missing object (but referenced) object is promised. I further edited the task-list by striking-through what I think should change with annotations, and kindly ask you to acknowledge it by removing these edited portions or substituting them with something that fits better. It's really just an attempt to collaborate on the task list without me repeating/quoting everything here. If it doesn't work, please let me know, otherwise, great :).

Thanks again @Byron. I fully understand how much effort it takes to ramp up a new (prospective) contributor to a software project, and I'll do my best to make sure you haven't wasted your time!

That is great to hear you are dedicated to the feature, and I hope that you (and the company backing you) will be able to support partial clones from hereon out :). On the bright side, even if that doesn't happen, nobody can predict the future, I hope we get to the point where we can:

have knowledge about promisor packs
can fetch as set of promised objects explicitly
have a set of utilities to collect promised objects efficiently

With the building blocks above it will be possible for each algorithm to eventually pre-fetch objects as needed, and until that, fail relatively gracefully with 'object '' not found - it's probably promised' at least.

From there, in a piecemeal fashion, one can add promisor support to whereever they are needed.

Edit: I can imagine that we'd want to add a function to gix to collect statistics about promised objects, essentially doing a connectivity check while listing all missing objects by type and count (i.e. 42 trees, 500 blobs), just for development and informational purposes. I don't think that this would be used for anything except, maybe, also adding a way to fetch all missing trees and blobs on the commandline. That would then allow users to make any algorithm work after it failed due to missing objects, at the cost of possibly fetching too much. Such a tool could even provide hints at how big that pack would be when fetched.

cesfahani commented 11 months ago

@Byron Thanks for the (extensive) help on getting #1081 merged!

I've started work on getting a minimal blob-filtered clone going - it was surprisingly not too difficult to get a (rough) first-pass functioning as expected. gix-protocol already supported this at the low-level, so the only changes I've needed so far are to the gitoxide-core, gix, and gitoxide crates.

My WIP branch is available here, but I'm just leaving this here as an FYI. There is still much to do on my end, and it isn't yet ready for review.

soloturn commented 6 months ago

what is missing @cesfahani ? this would be most likely the feature with most value to speed up git blones in cargo, see https://github.com/rust-lang/cargo/issues/11165.

cesfahani commented 6 months ago

@soloturn I have this all working in a fork, with some caveats:

Fetching of missing blobs doesn't happen transparently. If you attempt to checkout a worktree and blobs are missing, it will fail. Before checking out that worktree, you must go scan the repository for blobs missing for the specific revision and fetch them prior to attempting the worktree checkout. This probably should be made automatic (at least for high-level APIs and the CLI), but I haven't gotten there yet.
As hinted at above, I haven't plumbed any of this through to the high-level APIs or the the CLI app(s).
Tree-less partial clones are not supported. It could be made to work, but I think it means that when an operation is performed that requires fetching missing objects, you may end up needing to do two round-trips to fetch everything - one for the missing tress, then a second for the missing blobs (now that you know which exactly are missing).

The fork I've been most recently working out of is hosted in my company's private git service, but there's no reason I can't push it up to my Github fork. I've been meaning to do that - just haven't gotten around to it yet. I'm happy to get back into this now.

I have a couple PR's to make for simple bug fixes, then I'll need some input on how best to expose the API to allow an explicit set of objects to be fetched (without attempting any negotiation rounds, etc). The way I did this in my fork was a total hack (duplicated a bunch of code), but I really just wanted to get something working.

Expect to see some progress from me this week!

soloturn commented 6 months ago

not sure if i understand correctly. the most practical flag, at leasst for rusts cargo, is:

--filter=tree:0

as it would permit more easily to check out different versions. this is supported?