Open Byron opened 3 years ago
In pursuit of better control pack generation and also pave the way for improved async
integration, I figured having an Iterator
interface would be a good idea.
Now it's possible to step through parallel computations.
However, the respective implementation has to expose unsafe
due to the use of a scoped thread which exposes its join handle that can thus be leaked.
Depending on where this is exposed, unsafe
might bubble up even further - after all, anything that holds the SteppedReduce
can also leak it.
My intuition is to stop bubbling this up beyond git-features
just to keep it practical, even though technically it's incorrect. What do you think, @joshtriplett?
It's a CPU-intensive operation; my first instinct would be to run it normally and use unblock or similar to run it on a blocking thread.
Trying to structure the computation so that it happens incrementally seems incredibly painful. And in particular, trying to adapt an operation that happens in a thread to happen incrementally seems like it's incurring all the pain of async without any language support for async.
I would suggest building the initial MVP in a synchronous fashion, on the theory that it can still be run in a background thread and controlled via an async mechanism.
I definitely don't think it's OK to use a scoped thread and hide the unsafe, if the unsafe isn't truly encapsulated to the point that you can't do anything unsound with the interface.
One other thought on that front: compared to the cost of generating a pack, one or two allocations to set up things like channels or Arc will not be an issue.
Thanks for sharing. The main motivator for using scoped threads is to allow standard stack based operation without any wrapping - it's not at all about allocations, merely about usability and the least surprising behaviour. Truth to be told, I cannot currently imagine how traversal will play into static threads when arcs are involved especially along with traits representing an Object (an attempt to allow things like traversing directory trees).
What I take away is the following
unsafe
unless it's encapsulatedblocking
would work, too.I hope to overcome my 'writers block' and just write the missing bits to be able to see through the whole operation and play with the parts more until I find a version of the API that feels right.
The unsafe
is now legitimately gone due to the usage of standard 'static
threads. I could assure myself that the mechanism still works even with Arc's involved, despite being a little more difficult to use on the call site. Callers will need to prepare a little more to start the procedure, which is probably acceptable given how long it runs and how 'important' it is.
let's not make step-wise computation or extreme async friendliness a requirement for the MVP if something like blocking would work, too.
This capability probably doesn't have to be removed just yet as machinery itself is exactly the same as already used in in_parallel()
, except that now there is more control on the call site. This comes at the cost of having to deal with Arc
for the object database, and of course that the API now has yet another way to call it. Those who don't need fine-grained control will not get the best experience that way.
However, it's possible to eventually provide a non-static variant of pack generation too which works similar to pack verification (it uses the non-static version of the machinery) by factoring out the parts that are similar.
Another argument for trying hard to make pack generation play well in an async context certainly is that it commonly happens as part of network interactions like uploading a pack. Right now much of it is a little hypothetical as actual code to prove it works nicely doesn't actually exist, but I am confident it will work as envisioned.
Finally, since both machines, static and non-static are the same at core it should always be possible to return to the non-static one at very low cost should everything else fail.
On another note: I am also thinking backpressure and and back-communication. Backpressure is already present as threads will block once the results channel is full. Back-communication should also be possible if the handed-in closures get access to another synchronized channel of sorts to tell them when to deliver the pack entries they have been working on. Such an algorithm would continuously work (probably until it can't meaningfully improve the deltas) until it is told to deliver what's there right now before continuing. Such a message could then be delivered in the moment somebody actually calls .next()
on the iterator, which in turn will be based on how fast data can be written to the output (sync or async).
Even though the MVP will not do back-communication I don't see why it shouldn't be possible to implement it. What's neat is that no matter how the machinery operates, in the moment the iterator is dropped it will (potentially with some delay), stop working automatically.
The first break-through: pack files (base object only) can now be written from object ids.
@pascalkuthe I have created quick profile from running cargo build --release --no-default-features --features max,cache-efficiency-debug --bin gix && /usr/bin/time -lp ./target/release/gix -v free pack create -r ../../../git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux --statistics --thin -e tree-traversal --pack-cache-size-mb 200 --object-cache-size-mb 100 HEAD
(single-threaded), and here is the result:
My takeaways are as follows:
memchr
seems particularly hot. It's all about finding a null-byte here and I wonder if this can be any faster though.hashbown::set::HashSet::insert()
This is just a quick summary, and right now I am missing a dataset to compare git with gix
of various repos of different sizes to understand the size of the performance gap in single-threaded mode. From there it might be possible to figure out what to focus on.
While at it, profiling git
might also be useful, which (I think) I did in the past as well. Unfortunately my memory (as well as the notes about this here) are spotty.
gen-pack like plumbing command
Generate a pack using some selection of commits or possibly objects. Drives different kinds of iteration as well as ways of building a pack.
Progress
gixp pack-create
Easy*
hash_hasher
for single-threaded counting to see how it performs compared to the standard hash-set with hasher override.prodash
will cause slowdown.ignore_replacements
. It's correct to ignore replacements during reachability traversal, i.e. packpack-index-from-data
.Command-lines
create a full pack fast, like clone
create a partial pack for fetches
create a pack with git
Out of scope
reuse existing deltas - doing so would allow to save a lot of time and avoids the need for implementing actual delta compression. One would only have to get the order of entries right to assure consistency.Without that it's not really usable.gixp
subcommand to make the functionality available for stress testing and performance testing *Machinery to produce object deltas.User Stories
Interesting
Archive
### Progress * [x] stream pack entries (base objects only) from an iterator of input objects * [x] write pack entries to a V2 pack * [x] **clone-like** - the other side has no objects * write all objects based on unique objects visible in trees using tree- * [x] **fetch-like** - the other side has some objects * write only changed objects based on another party which has some of our objects, but not all, employing tree-diff capabilities. * [x] Handle tag objects - their pointee must be added to the set as well. * [x] Make it so that the total amount of objects to be written is known in advance * if the header of the pack wouldn't contain the amount of objects or it would just be bogus, this wouldn't be necessary and allow for a much fast start of the operation. Does git allow invalid counts? * Statistics about re-used objects and repacked ones, maybe more. This is critical to eventually avoid repacking most. * [x] for counts (note about 4x more decoded objects of diffs compared to traversals) * [x] for entries * [x] pack-create program writes properly named pack files and outputs statics with `--statistics` * [x] restore chunk-ordering to allow multi-threaded generators to yield the same pack all the time (chunks may be out of order) * [x] **figure out why the pack hash in parallel mode is still not reproducible** * ~~write object entries directly, without forcing to copy their compressed data into an `output::Entry`~~ * Let's keep copying it as it simplifies the writer and avoids it having to do pack lookups on a single thread. This also means that more memory is used while chunks of objects are in flight and that more memory allocation is done. Probably worth it. * [x] re-use delta objects as well and write their offsets correctly. * **`gixp pack-create`** * [x] figure out why git with one thread is faster when enumerating objects than us with 4 :D (with the Rust repo in particular) * dashmap::insert() costs 41% of the runtime apparently. A single-threaded counter might be beneficial for using non-threadsafe types. * Using a much bigger pack cache helps a lot with 60s lower counting time, making us just 20s slower than git itself. * pack-locations are looked up in single-threaded mode, but that's fast and not doing so doesn't significantly speed things up * ⚠️ Note that thus far we were unable to improve counting performance * [x] can't pack the Rust repository as it can't find an object that it apparently traverses and that does not exist ( 221483ebaf45df5c956adffee2d4024e7c3b96b8, 6c4f4e1990b76be8a07bde1956d2e3452fd55ee4, 7bda1161a37ff51f254ff0a7862abe6dc54fdb36 ) (it's many objects actually with non-determinisic counting) * [x] find a way to not have to collect input objects beforehand, it's something about error handling and types here rather than technical requirements. * [x] a list of objects from stdin (like created by `git rev-list`) (AsIs traversal is default here) * [x] a set of tips for internal rev-list traversal of the commit graph with settings for traversal or diff based expansion, with AsIs being an option too * [x] tui support * [x] A way to place and control object data caches for 4x and more speedups of tree-diffs * [x] https://github.com/servo/uluru/pull/22 - when landed, can bring back 8d499762b74c08437d901bb98806e0a1fc6f93bb for reduction of allocation pressure. * [x] #167 * [x] #170 #### Performance Opportunities * Need to decompress entries to find their length - ~~it would save a lot of time if we would know where the next entry begins.~~ * The index has that information but only if we build a sorted vec of all offsets. How would that fit in? Maybe a high-performance trait with more complex call signatures to allow building caches? * No, this doesn't make a difference at all interestingly, maybe 1.5s of 48 at most. Implemented in ad6d00701d28befda006f41f85bbbb6fc3508e1e https://github.com/Byron/gitoxide/blob/34b6a2e94949b24bf0bbaeb169b4baa0fa45c965/git-odb/src/linked/find.rs#L52 ### Notes #### packs in general * [on demand loading of packs](https://github.com/git/git/blob/master/object-store.h#L165:L165). Same story for **[libgit2](https://github.com/libgit2/libgit2/blob/34b9a04c77235dfc9401cb7ef8efc6c4ee7e99dd/src/odb_pack.c#L120:L124)** * alternates should be expanded to a list of object repositories that don't know alternates. Related to #66 . * As they keep the amount of stored objects in the header, immediate streaming is limited by knowing that in time. https://github.com/Byron/gitoxide/blob/ad04ad3b8ac54e78bee307dd44c85c1389edced2/git-odb/src/pack/data/init.rs#L32-L33. Streaming can only start once all objects to send have been discovered. #### Pack generation * They use a list of objects to handle along with [workstealing among threads](https://github.com/git/git/blob/master/builtin/pack-objects.c#L2763:L2779). * ~~These threads regularly pause their work to allow the stealing to happen safely which might be why that doesn't scale?~~(actually that one does seem to scale at least with the amount of core I have, it's pack resolution that doesn't scale) * probably a good idea to not try to find deltas for 'big' objects and make the threshold configurable.