Open cesfahani opened 1 year ago
Thanks a lot for putting this feature on the radar permanently, and for starting the discussion about it.
- I'm not seeing anything like this [verification of blobs/fsck] today in
gitoxide
, but I may be missing something...
There are various verify()
methods that mostly validate indices and their packs directly. I believe this is implemented for entire object databases as well. Since these aren't based on graph traversal, while ignoring .promisor
packs, they probably wouldn't even fail. In order to get there, one might first want to implement a connectivity check which also isn't present right now (so gitoxide
implicitly trusts packs produced by a server).
gix
(the crate) serves as hub for all gix-*
plumbing crates. These themselves can refer to other plumbing crates but avoid doing so if possible. Thus, many crate boundaries are 'crossed' by passing closures which provide certain kinds of data, like git-attributes or object data. The latter typically results in impl FnMut(&gix_hash::oid, &mut Vec<u8>) -> Result<gix_object::ObjectType, E>
.
Even though this represents the typical case of 'fetch one object at a time' which can be retrofitted quite easily, I'd recommend making this a shared trait that represents an object database. That way, multi-fetching facilities can be provided later. A great candidate for using these would be gix-worktree-state::checkout()
which happens to know all of the required blobs in advance as it has access to the index. Please note that such traits exist both in gix-pack
and gix-odb
, with different levels of capabilities catered to their respective use-cases. A shared trait would live in neither of these crates but need a new one so it can be depended on by plumbing crates which previously avoided depending on gix-odb
in favor of closures.
Of course, to get the index one has to build it from a tree, which now may have promisor trees on the way to the blobs/leafs, so building an index from a tree would have to be adjusted first.
All of the above implicitly assumes that fetches with filters work in the first place.
What matters most to me is that it's easy to determine if promisor objects are present at all to be able to bypass any kind of 'project-gathering-dry-run' unless it's known to be required. In other words, everything should stay as fast as it is without promisor objects. I think this shouldn't be a problem though as it's easy to determine if promisor objects are present in the ODB or not.
Thanks @Byron. Things are slowly starting to get pieced together in my head.
There are various
verify()
methods that mostly validate indices and their packs directly. I believe this is implemented for entire object databases as well. Since these aren't based on graph traversal, while ignoring.promisor
packs, they probably wouldn't even fail. In order to get there, one might first want to implement a connectivity check which also isn't present right now (sogitoxide
implicitly trusts packs produced by a server).
That sounds like a good starting point to me. After playing around with the various verify()
methods, nothing seems to be doing a full walk of every commit and checking that all trees and blobs are reachable. I was able to confirm that with a partial clone, I can diff the enumerated entries from the HEAD
and HEAD~1
trees and see the expected missing blob sizes (conveniently, this is available in gix
CLI):
$ diff <(gix tree entries -r -e HEAD) <(gix tree entries -r -e HEAD~1)
331c331
< BLOB c8055d104958c490dd496f49dcd74d1f41dedbf0 3565 gix-attributes/src/state.rs
---
> BLOB 585e05adcdd9f4b57aa4f93b44dbeeb6ba760544 gix-attributes/src/state.rs
583c583
< BLOB 5e5a1e6e03f919f476ea4ae4ed7363da1e6eed42 8330 gix-url/src/parse.rs
---
> BLOB 8e1a57db1a04e81218c6a7f8ca5cf8cf5bdd5b9a gix-url/src/parse.rs
688c688
< BLOB f327cab204b22c7ead3e618a73d92704ef163dfa 16137 gix-attributes/src/search/outcome.rs
---
> BLOB 9ad8a81e1d771fdda65d9260e2d4bf120cd489c9 gix-attributes/src/search/outcome.rs
1104c1104
< BLOB c694334dddf4b8e32d8510fa05d5a5dd566bd807 4087 gix-url/tests/parse/mod.rs
---
> BLOB 1309fd133f29aa01b98cb17b65f034f481410f7c gix-url/tests/parse/mod.rs
1627d1626
< BLOB dd3b2c1d43d2f1af52444804bde68d990fd2c0fb 149713 gix-url/tests/fixtures/fuzzed/very-long2.url
I see three possible levels of connectivity checks, ranging from the simplest/fastest to the most complex/slowest (I'm ignoring some less-common ones for now):
oid
or rev-spec
as an input, then perform the equivalent of git fsck --connectivity-only --no-dangling <rev-spec>
. This would be relatively fast, as we're only walking a single ancestry tree, and don't have to worry about unreachable or dangling objects.rev-spec
explicitly provided. This would be the equivalent of git fsck ---connectivity-only --no-dangling
, and would use the index, all refs
, and all reflogs as heads.git fsck --connectivity-only
, which (in addition to the above) also prints out all objects that are not directly reachable by any other object. This will be more computationally intensive and will require significantly more runtime memory, as we'd need to (I think?) maintain a table of unreachable objects, inserting into it as we traverse new objects, and pruning from it as we find object references. Looks like gix-hashtable
already implements the required data structure, but because this operation needs to start with a set of all objects and then walk the ODB to determine reachability of each, optimizing it seems like it'll take some thought.How about I start with just 1 & 2 above? Those seem relatively straightforward, and I believe they'd satisfy the immediate goal of making it easier to implement & test promisor objects. The third one can come later.
If that sounds reasonable, then my next question would be: where should this implementation go?? Wherever it ends up, it will need to utilize functionality from several existing crates (via the gix
crate, as you described), but I'm feeling like this might warrant a new gix-fsck
crate.
(conveniently, this is available in
gix
CLI):
That's clever - I forgot gix tree entries
is even a thing 😅, especially one that can display the sizes of objects.
How about I start with just 1 & 2 above?
Do I understand correctly and the reason for the connectivity check to be required is that one has to create the promisor
pack (or its index maybe) oneself based on the objects that have been omitted in the received pack?
(via the
gix
crate, as you described)
Maybe it's just me misunderstanding what you meant to say, but in general, gix-*
crates can't ever reach 'up' to the gix
crate. They may use each other though, with careful consideration. gix-fsck
sounds like a crate one wants to have especially if git
internally also implements connectivity checks using git fsck
invocations.
When I thought about connectivity checks (independently of promisor packs or indices) I always thought that this should be integrated tightly with the current pack resolution machinery. It sees all objects in the packs and learns about their hashes (these have to go into the index after all), and it's perfectly multi-threaded. Ideally, with minimal performance loss, connectivity checks could be implemented. Since the index
file will already contain the set of all objects, maybe it's possible to reduce the memory requirements by referring to a memory mapped version of it.
I know, I know, all that sounds like trouble and a separate walk-based implementation would be easier to do, but it would also be so much slower. Maybe with some thinking, connectivity checks can be built-into the pack resolution, but if not, doing it in a separate step seems alright as well.
Do I understand correctly and the reason for the connectivity check to be required is that one has to create the
promisor
pack (or its index maybe) oneself based on the objects that have been omitted in the received pack?
A connectivity check isn't required by any means, at least not for implementing promisor packs. I had thought you were recommending doing that as a starting point (which I do agree is a good idea) based on this comment:
In order to get there, one might first want to implement a connectivity check which also isn't present right now
I think this would definitely be helpful to have, as it would let me verify that the expected promisor objects are fetched/resolved during different operations on a partial cloned repository. Even before I have the fetch with filters implemented, a connectivity check would let me verify that gitoxide
reports the same missing/promisor objects as git
when run against a test fixture repo. I'd be looking to compare the connectivity check's result to something like:
git rev-list --objects --missing=print HEAD | grep -E '^\?\w+'
Maybe it's just me misunderstanding what you meant to say, but in general,
gix-*
crates can't ever reach 'up' to thegix
crate.
Nope - this was totally my own misunderstanding of what you had said in your last comment about gix
being "the hub" for all plumbing crates. With the help of cargo-depgraph, I was able to get a much better understanding of the relationship between various gitoxide
crates. In case anyone's interested... cargo depgraph --workspace-only | dot -Gsplines=ortho -Tsvg > graph.svg
- super cool stuff!
When I thought about connectivity checks (independently of promisor packs or indices) I always thought that this should be integrated tightly with the current pack resolution machinery. It sees all objects in the packs and learns about their hashes (these have to go into the index after all), and it's perfectly multi-threaded.
That sounds perfectly reasonable. I'll start digging into how this would best integrate into the current pack indexing & lookup architecture.
Since the
index
file will already contain the set of all objects, maybe it's possible to reduce the memory requirements by referring to a memory mapped version of it.
I thought that was already implemented here? But yes - mmap
'ing the index files is definitely the right way to go.
Even so, the third connectivity check mode I mentioned in my last comment still has the added requirement of marking each object that is reachable/referenced from any other object, and then once the traversal is complete, it needs to report all objects that were not marked accordingly (i.e. "dangling objects"). While the traversal algorithm can (and should) certainly be highly parallelized, there is still a non-trivial memory requirement to maintain the data structure of dangling objects. git
does this with the data structure instruct repository
's parsed_objects
. Each object in there has a flags
field, and the connectivity check sets the REACHABLE
bit as it traverses the repository. gitoxide
can certainly do better, but I'm not sure there's any way to get around the memory requirement of needing to maintain a mapping of all objects to an indicator of whether or not they are reachable...
I'm not planning on addressing that use-case of scanning for dangling objects right now, as that's entirely unrelated to partial clones. But it will be something to consider once features like garbage collection come to head.
Thanks again @Byron! Please let me know if I've misunderstood anything, but otherwise I think I've got plenty of info to get me started.
I think there are a couple of things worth untangling, from my perspective:
With all that said, I am not convinced that connectivity checks integrated into pack resolution are a good idea, and if truly needed, they should be done as standalone capability. For tests or testing and verification it might also be possible to start out with just using git
directly for that. It's your choice though.
It might seem I am changing my mind a lot, which might be due to my utter lack of understanding how git
actually implements this capability. The current way of working is that you lead the feature and I support with gix
specific knowledge, but I welcome you to share all you know and update the tracking ticket with links to resources and/or git
source code as we progress, which should help me to catch up.
Apologies for the delay! I had some other stuff come up, but have also been spending time trying to better understand the intricate details of git
partial clones, which I feel I have a much better handle on now.
Before I dive into what I've learned about partial clones in git
, here's a quick taste of where I've gotten to in identifying missing/promisor objects in gitoxide
, and how its performance compares to git
.
Note: Please ignore the fsck connectivity
arguments to gix
below. I had started out with a connectivity-check in mind, but since then I've just been using that subcommand as my playground for testing things out.
# Perform a blobless clone of gitoxide
cesfahani@diesel-tr:~$ git clone --filter=blob:none https://github.com/byron/gitoxide.git gitoxide-blobless
cesfahani@diesel-tr:~$ cd gitoxide-blobless
# Check number of missing objects with git
cesfahani@diesel-tr:~/gitoxide-blobless$ git rev-list --objects --quiet --missing=print HEAD | wc -l
45439
# Check number of missing objects with gix
cesfahani@diesel-tr:~/gitoxide-blobless$ gix fsck connectivity HEAD | wc -l
45439
# Yay! Now let's diff the (sorted) contents to double-check...
cesfahani@diesel-tr:~/gitoxide-blobless$ diff <(gix fsck connectivity HEAD | sort) <(git rev-list --objects --quiet --missing=print HEAD | cut -c2- | sort)
# Perfect match! Now let's check performance...
cesfahani@diesel-tr:~/gitoxide-blobless$ time git rev-list --objects --quiet --missing=print HEAD > /dev/null
real 0m1.210s
user 0m0.454s
sys 0m0.730s
cesfahani@diesel-tr:~/gitoxide-blobless$ time gix fsck connectivity HEAD > /dev/null
real 0m0.551s
user 0m0.458s
sys 0m0.093s
This is great, but even better is the fact that I did zero of that fearless parallelism you mentioned in #1041 - my naive implementation is single-threaded and written by someone entirely new to gitoxide
(and also mostly new to Rust). Parallelizing this should be straightforward and I expect it will significantly improve performance, but it's awesome to know that my dumb first-pass is already over 2X faster than git
.
Now - about what I've learned about partial clones...
When we fetch a pack from a promisor remote, we end up with a .promisor
file in addition to the usual .pack
, .idx
, and .rev
files. I was initially a bit confused as to what the contents of this .promisor
file were for... Interestingly, when doing a repack (e.g. via git gc -aggressive
), I found the resulting single pack's .promisor
file was empty, despite the original pack .promsior
files having a bunch of object IDs in them.
If you read the first bullet of Handling Missing Objects, it says the promisor file has "arbitrary contents". Upon digging into the git
source, I can confirm that the .promisor
file is in-fact never actually read - only its presence is important. You can see here that when doing a repack, git
will just always write out an empty .promisor
file. I expect that some old version of git
used the contents of these files for something, but I'm not sure what. The Footnotes describe a possible explanation.
In the same Handling Missing Objects section, the second bullet describes how git
decides that an object is a promisor object. This aligns with the is_promisor_object
function definition. This function lazily creates a static set of missing promisor objects for each pack. You can see in the add_promisor_object
that it will walk each object in the pack, and find anything it refers to that is not in the pack. It designates each such object as a promisor object.
All this looks good, however it doesn't align with all the behavior I see in git
. I've noticed this before, but only now did I dig into it. Brace yourself - you're not going to like it...
In git
, if I attempt to do something like cat-file
on a totally invalid/non-existent object ID, you'll notice it takes a bit longer than you'd expect. Add some tracing onto that, and you'll see the gross behavior:
cesfahani@diesel-tr:~/gitoxide-blobless$ GIT_TRACE=2 git cat-file -p 0000000000111111111122222222223333333333
20:12:02.891575 git.c:463 trace: built-in: git cat-file -p 0000000000111111111122222222223333333333
20:12:02.892446 run-command.c:659 trace: run_command: git -c fetch.negotiationAlgorithm=noop fetch origin --no-tags --no-write-fetch-head --recurse-submodules=no --filter=blob:none --stdin
20:12:02.895664 git.c:463 trace: built-in: git fetch origin --no-tags --no-write-fetch-head --recurse-submodules=no --filter=blob:none --stdin
20:12:02.896036 run-command.c:659 trace: run_command: GIT_DIR=.git git remote-https origin https://github.com/byron/gitoxide.git
20:12:02.898562 git.c:749 trace: exec: git-remote-https origin https://github.com/byron/gitoxide.git
20:12:02.898612 run-command.c:659 trace: run_command: git-remote-https origin https://github.com/byron/gitoxide.git
fatal: remote error: upload-pack: not our ref 0000000000111111111122222222223333333333
fatal: Not a valid object name 0000000000111111111122222222223333333333
This unfortunate behavior is explained by the third bullet in Handling Missing Objects... At the end of that bullet, you'll find:
For efficiency reasons, no check as to whether the missing object is actually a promisor object is performed.
I understand having to resort to this behavior if I had used a partial clone filter that filtered trees in addition to blobs, butgit
knows that I specified a blobless clone and not a treeless clone. Either way, the pack index will not contain any reference to the missing promisor objects, so the only way to know whether or not they are "promised" (short of asking the remote for them) is to see if anything in the promisor packs reference them. A big question remains as to how efficient it could be in gitoxide
to go check if an object ID is referenced by any promisor pack, and whether doing so would be more or less efficient than just asking the remote for the object.
@Byron I'd love to get an opinion on this. Right now my thinking for a minimal implementation is :
filter
fetch option.promisor
file in each)gix-odb
, we'll first search the indexes of each pack (as done today), but if & only if the object isn't found, we'll (in parallel) go search for missing objects in each promisor pack (if any exist)
Thanks so much for the update - exciting!
I love that you already cooked up a version of the connectivity check and that it apparently runs very fast, too! Too fast, if you ask me, as usually often-run algorithms in git
are very well optimized so that even being as fast as git
can be hard at times. Unless, of course, one finds a better algorithm or in absence of that, it's one of the few inefficient implementations.
When looking at the times, I find both very slow anyway, as this is the gitoxide
repository which is a speck of dust compared to what I'd look at when comparing the performance. My recommendation is to also try consecutively larger repos and see if the performance benefit remains.
For efficiency reasons, no check as to whether the missing object is actually a promisor object is performed.
I think it's easy to do better than that as this could simply be (more) configurable. Gitoxide already as its own gitoxide
configuration section for that. I am thinking to have multiple modes with different performance characteristics. It's also possible to change settings per handle to the object database - this is used for instance to determine whether or not to re-read the object store to get see new packs if an object wasn't found.
However, in this case I think what they do is reasonable, as they are assuming that if an object doesn't exist, it's a promised object which during normal operation should be the case. And if one doesn't expect that for whichever reason, it would be trivial to change the configuration of the object handle before potentially hitting a lot of nonexisting objects.
Integrating a set of promisor objects for lookup wold need to be done in the belly of the beast which is the threadsafe part of the object database. That way, these objects can be shared among all handles, and one typically has a handle per thread at least.
Updating promisor objects lazily is the way to go, and it should be possible to integrate it well with everything else. But it's going to be tricky as the object database is a lawless zone where borrowchk has no power 😅 (and introducing races is easy unless great care is taken). But it's far from impossible as well.
An alternative implementation would be to keep the promisor dataset in the handle itself so each thread would have to repeat the work necessary to build and maintain it - that's easy to implement as it's borrowchecked, but I guess one should try to do it properly first.
I seem to be missing a plan to integrate gix fsck connectivity
- it's OK if it's a bit hacky even as gix
is a developer tool. Sometimes code starts out as subcommand and then 'graduates' into gix
(crate), which is a path I can imagine for the connectivity check as well.
Allow a set of missing objects to be built-up (on-demand) for a promisor pack
That seems reasonable, I just wanted to add that once an object is missing, one has searched all packs. Any amount of these could be promisor packs. For perfect laziness one would be able to tell the underlying shared ODB to find the given object in one of the promisor packs, whose 'promised' objects will be loaded on demand until the object was found. These promised objects would be kept in a shared read-only cache to accelerate future access. Of course that also complicates the implementation and I can imagine to start out with "load all promised objects, then check for membership" as well.
& only if the object isn't found, we'll (in parallel)
(emphasis mine)
Probably not in parallel as the ODB isn't allowed to spawn threads yet. It works by using the 'power' of the thread that uses the handle, and the current implementation can leverage this as to speed up operations if a trampling horde comes in and needs to load packs - these will be loaded in parallel (if I remember correctly). But at first, I don't think there is a need to be super efficient, it's enough to have a single thread do the work and do so correctly. Trust me, the ODB core is a nightmare as borrowcheck isn't present and it's highly concurrent - things should be kept as simple as possible.
With that said, 'as simple as possible' might also mean to do it like git - assume a missing object is promised as long as there is a single promisor pack in the known set of packs. That seems like a fair assumption and would be my starting point - it does safe the whole connectivity check business and would allow to keep the ODB core mostly unchanged - it just needs a new flag to check if a promisor is there or not, and that's easy to implement correctly.
However, implementing the dynamic fetching of objects on such a way that it is thread-safe and correct and ideally allows multiple threads to fetch objects at the same time, that's going to be an interesting problem to solve efficiently. Getting a first version done should be relatively straightforward though as each thread (in the code of the handle) could just initiate their own connection and operate without shared state. It's just horribly inefficient to do so… fetch a pack with one object, then refresh the ODB to pick up the object, then search that pack specifically if you can or re-search all packs. Horrible…for a single object.
You know, I think that even if it was implemented, having this in place will make any ordinary and unaware algorithm so painfully slow that people wouldn't want to wait around for the result anyway. And they'd end up with hundreds of packs of 1 object as well (unless, maybe, one explodes these into a loose objects, that's fair and probably has to be done to be viable).
Somehow I feel that on-demand loading of single objects shouldn't even be done, but instead one should make sure that each algorithm can deal with promisor objects. A graph-walk, of course, would be in trouble as it has to discover everything, but maybe that's a special case, too?
With all that said, please feel free to digest this into a potentially new (or changed) list of things you'd want to implement.
I love that you already cooked up a version of the connectivity check and that it apparently runs very fast, too! Too fast, if you ask me, as usually often-run algorithms in
git
are very well optimized so that even being as fast asgit
can be hard at times.
I should clarify one thing here. My first attempt at doing this yielded results like so:
cesfahani@diesel-tr:~/gitoxide-blobless$ time gix fsck connectivity HEAD > /dev/null
real 0m2.355s
user 0m0.892s
sys 0m1.445s
This is twice as slow as git
. I then ran this with flamegraph
, and noticed the large amount of time spent in consolidate_with_disk_state
, which led me to Handle::refresh
. I then added a repo.objects.refresh_never()
at the top of my connectivity check, and that got me to the performance I showed earlier (twice as fast as git
).
I'm not sure if this is cheating or not... I had expected that git
wouldn't be doing any sort of auto-refresh from disk during operations that walk a commit graph like this, but maybe I'm wrong? When dealing with promisor packs, refreshing from disk when an object is missing is a performance-killer (as seen here), as that is fully expected to happen regularly.
When looking at the times, I find both very slow anyway, as this is the
gitoxide
repository which is a speck of dust compared to what I'd look at when comparing the performance. My recommendation is to also try consecutively larger repos and see if the performance benefit remains.
I also did some tests using one of the larger repositories my company has. It's proprietary, but I can assure you it is quite big. The bare repository consumes 11.5GB and includes well over 100K commits. Here's what things look like with that:
cesfahani@diesel-tr:~/proprietary-repo$ time gix fsck connectivity HEAD | wc -l
241162
real 0m6.702s
user 0m6.151s
sys 0m1.377s
cesfahani@diesel-tr:~/proprietary-repo$ time git rev-list --objects --quiet --missing=print HEAD | wc -l
241162
real 0m7.187s
user 0m2.940s
sys 0m4.133s
In that same repository, commenting out my call to repo.objects.refresh_never()
makes things quite a bit slower:
cesfahani@diesel-tr:~/proprietary-repo$ time gix-r fsck connectivity HEAD | wc -l
241162
real 0m16.402s
user 0m8.420s
sys 0m9.136s
I'd love any insight you may have on the conditions in which git
refreshes packs & indexes from disk during a single operation, but otherwise I'm happy to dig into it.
However, in this case I think what they do is reasonable, as they are assuming that if an object doesn't exist, it's a promised object which during normal operation should be the case. And if one doesn't expect that for whichever reason, it would be trivial to change the configuration of the object handle before potentially hitting a lot of nonexisting objects.
The more I experiment with this, the more I agree. As you said, controlling this via configuration would be ideal. I can imagine a different git implementation (gitoxide
?) where users can disable automatic contacting of the remote for certain subcommands, and instead require them to opt-in to such behavior. This would also be helpful for incrementally implementing optimized algorithms that are promisor-aware.
Take, for example, git blame
. Right now, if you run this on a blobless repository, it will go and (painfully) fetch each blob in the tree's history, one at a time. Each will end up with it's own pack file, which will typically end up causing gc thresholds to be exceeded, so you'll get background gc happening in the midst of all this chaos.
Instead of git blame foo.c
, if you were to run the following, then you'd end up with the behavior you really intended (a single fetch of all missing objects as a single pack, and then the blame
):
git rev-list --objects --quiet --missing=print HEAD -- foo.c | cut -c2- | git fetch --no-write-fetch-head --refetch --stdin origin
git blame foo.c
I can understand some users dealing with smaller repositories being okay with the default behavior, but being able to configure different behavior would be fantastic IMO. Hopefully all algorithms can be made to be promisor aware, in which case this is a non-issue.
Updating promisor objects lazily is the way to go, and it should be possible to integrate it well with everything else. But it's going to be tricky as the object database is a lawless zone where borrowchk has no power
My core background is mostly C++, so the borrow checker is a new-found luxury to me. I'm more than happy to have a go at doing it properly first. If that proves to be too complicated, then I can fall back to a per-thread (Handle
) set of promisor objects.
I seem to be missing a plan to integrate
gix fsck connectivity
- it's OK if it's a bit hacky even asgix
is a developer tool.
Happy to hear that! Let me go ahead and polish up what I have (e.g. make it not panic upon any sort of error condition) and then I'll get an MR up.
For perfect laziness one would be able to tell the underlying shared ODB to find the given object in one of the promisor packs, whose 'promised' objects will be loaded on demand until the object was found.
Yesssss - I like this very much!
Somehow I feel that on-demand loading of single objects shouldn't even be done, but instead one should make sure that each algorithm can deal with promisor objects.
I 100% agree here, and my immediate use-cases have an anti-goal for that sort of lazy & inefficient fetching of one object at a time. I'm more than happy to remove that from the task list.
As described in my git blame
example above, if a user attempts to perform an operation that is going to end up slowly and inefficiently fetching single-object-packs (because the algorithm isn't promisor-aware), I'd prefer them to just be told that at least one object is missing (assuming they have a promisor remote configured), and possibly be given a hint as to how to remediate the situation. This would, of course, be a stop-gap until various algorithms can be adjusted to be aware of the promisor scenario, but I think that's perfectly fine.
With all that said, please feel free to digest this into a potentially new (or changed) list of things you'd want to implement.
I'll update the task list in the original issue description now, but I'm also looking forward to continuing to iterate on ideas!
Thanks again @Byron. I fully understand how much effort it takes to ramp up a new (prospective) contributor to a software project, and I'll do my best to make sure you haven't wasted your time!
I'm not sure if this is cheating or not... I had expected that git wouldn't be doing any sort of auto-refresh from disk during operations that walk a commit graph like this, but maybe I'm wrong?
There is no cheating, and turning off automatic refresh on miss seems like the right thing to do. I'd expect git
to do the same.
Looking at the distribution of user
and sys
times I notice that gix
spends too much time in userland compared to git
, and git
spends suspiciously large amounts in sys
land. There might be another opportunity for optimization there, who knows. It's hard to say without seeing the code, and I am looking forward to that PR.
I'd love any insight you may have on the conditions in which git refreshes packs & indexes from disk during a single operation, but otherwise I'm happy to dig into it.
I don't think there is any need if it seems to work and is faster, it's a win I am happy to (tentatively) take :D.
Hopefully all algorithms can be made to be promisor aware, in which case this is a non-issue.
I think gitoxide
is in a great position to do it right™️ from the beginning, and for many commands it shouldn't be too hard if they'd have a way to get their working set of objects beforehand. However, and while blobless repos seem easiest, treeless repos appear to pose their very own problem that I'd probably simply ignore. If trees are missing and you have to deal with path/to/foo.rs
, there is no other way but to painfull retrieve a tree at a time (root
, path
, to
, finally the blob foo.rs
). Maybe… this works by getting the entire tree recursively though, so it a blame that can deal with treeless and blobless repos would 'simply' have to deal with two passes, the first one gets all the trees, the second one gets the blobs that it actually needs. Somewhat involved, but doable, even though I always see it as something implemented on top. Maybe that 'on top' will end up being a set of utilities with common algorithms that can be mixed and matched, let's see.
If that proves to be too complicated, then I can fall back to a per-thread (
Handle
) set of promisor objects.
I just repeat this here to say that I think we share the understanding that the ODB won't ever auto-fetch promised objects, but that there may be an API (not necessarily on the ODB though) to fetch a set of objects and deal with promisor-related things as needed.
That way, it will only work with promisor-aware algorithms, but that's fine, while I'd expect 'the machinery' to upgrade all object not found
errors to also include a hint in case there are promisor packs present. Again, this is based on the assumption that fetching objects one-at-a-time is a non-starter anyway, while seriously complicating the already complex implementation that is the lock-free ODB implementation we have today.
I'll update the task list in the original issue description now, but I'm also looking forward to continuing to iterate on ideas!
I took a look (and updated the formatting a little) and would like to point out that I don't think any time should be spent on obtaining a set of promised objects as it seems sufficient to assume every missing object (but referenced) object is promised. I further edited the task-list by striking-through what I think should change with annotations, and kindly ask you to acknowledge it by removing these edited portions or substituting them with something that fits better. It's really just an attempt to collaborate on the task list without me repeating/quoting everything here. If it doesn't work, please let me know, otherwise, great :).
Thanks again @Byron. I fully understand how much effort it takes to ramp up a new (prospective) contributor to a software project, and I'll do my best to make sure you haven't wasted your time!
That is great to hear you are dedicated to the feature, and I hope that you (and the company backing you) will be able to support partial clones from hereon out :). On the bright side, even if that doesn't happen, nobody can predict the future, I hope we get to the point where we can:
With the building blocks above it will be possible for each algorithm to eventually pre-fetch objects as needed, and until that, fail relatively gracefully with 'object '
From there, in a piecemeal fashion, one can add promisor support to whereever they are needed.
Edit: I can imagine that we'd want to add a function to gix
to collect statistics about promised objects, essentially doing a connectivity check while listing all missing objects by type and count (i.e. 42 trees, 500 blobs), just for development and informational purposes. I don't think that this would be used for anything except, maybe, also adding a way to fetch all missing trees and blobs on the commandline. That would then allow users to make any algorithm work after it failed due to missing objects, at the cost of possibly fetching too much. Such a tool could even provide hints at how big that pack would be when fetched.
@Byron Thanks for the (extensive) help on getting #1081 merged!
I've started work on getting a minimal blob-filtered clone going - it was surprisingly not too difficult to get a (rough) first-pass functioning as expected. gix-protocol
already supported this at the low-level, so the only changes I've needed so far are to the gitoxide-core
, gix
, and gitoxide
crates.
My WIP branch is available here, but I'm just leaving this here as an FYI. There is still much to do on my end, and it isn't yet ready for review.
what is missing @cesfahani ? this would be most likely the feature with most value to speed up git blones in cargo, see https://github.com/rust-lang/cargo/issues/11165.
@soloturn I have this all working in a fork, with some caveats:
The fork I've been most recently working out of is hosted in my company's private git service, but there's no reason I can't push it up to my Github fork. I've been meaning to do that - just haven't gotten around to it yet. I'm happy to get back into this now.
I have a couple PR's to make for simple bug fixes, then I'll need some input on how best to expose the API to allow an explicit set of objects to be fetched (without attempting any negotiation rounds, etc). The way I did this in my fork was a total hack (duplicated a bunch of code), but I really just wanted to get something working.
Expect to see some progress from me this week!
not sure if i understand correctly. the most practical flag, at leasst for rusts cargo, is:
--filter=tree:0
as it would permit more easily to check out different versions. this is supported?
Summary 💡
Tracking a minimal implementation of partial clones and their corresponding promisor objects/remotes, as discussed in #1041.
What is a partial clone?
Partial clone is a recent(ish) feature of
git
that allows the client to fetch a subset of objects from a remote repository based on criteria (i.e. a "filter") of its choosing. The missing objects are referred to as "promisor objects", and are expected to be able to be provided by "promisor remotes" on-demand after the clone, as needed.The most common use-case of partial-clone are where the client requests a clone with either no historical blobs (e.g.
--filter=blob:none
), or only historical blobs under some size threshold (e.g.--filter=blob:512k
). Tree objects can also be filtered by a partial clone, however that use-case is far less common.Lessons learned from
git
Because partial clone was retrofitted into
git
, there are several performance gaps that have not yet been resolved. Operations likefetch
andcheckout
behave exactly as one would expect - the missing objects are fetched in a single transaction with the remote. Other operations, such asblame
andrebase
, do not do this, and instead end up lazily fetching missing objects one at a time (each with a separate transaction to the remote), which significantly slows things down.To implement partial clones efficiently, operations that traverse history and require inspecting the contents of blobs and trees need to:
That said, this feature does not aim to implement the optimized approach to partial clones across the board. However we would like to see APIs designed to facilitate the optimized approach, and possibly one implementation of the optimized approach to be used as a reference and proof that things can be made to work as expected.
Tasks
gix fsck connectivity
)remote.<name>.partialclonefilter
remote.<name>.promisor
true
if a partial clone filter was providedgix
CLIgit
partialclonefilter
andpromsior
are set appropriatelypromisor
andpartialclonefilter
config settings on the remotespromisor
pack