GitoxideLabs / gitoxide

An idiomatic, lean, fast & safe pure Rust implementation of Git
Apache License 2.0
9.17k stars 315 forks source link

[gixp pack-receive] The first proper fetch to a bare repository #104

Open Byron opened 3 years ago

Byron commented 3 years ago

Do what's needed to fetch as good as git does (on a bare repository, one without a working tree). This particularly includes proper ref handling as well as safety in the light of concurrent repository access.

Tasks

Archive ### Research ### Research #### Reflog Handling * entirely disabled in bare repos * forward iterators could be bstr::lines() * reverse-iterators could be bstr::SplitReverse with a VecDeque for refilling a read buffer from the end of a file with seeks. * line parsing is [here](https://github.com/git/git/blob/master/refs/files-backend.c#L1892:L1919) * expiry is done by rewriting the entire file based on a filter, writing is literally [here](https://github.com/git/git/blob/master/refs/files-backend.c#L3028:L3028) #### Refs Writing * You can turn a symbolic ref into a peeled one (i.e. detach a HEAD) with transactions but you cannot turn it back into a symbolic one with that. All that happens directly and outside of transactions. * Writing symbolic references like HEAD [splits the ref update](https://github.com/git/git/blob/seen/refs/files-backend.c#L2263:L2263) transparently and across any amount of refs. * You cannot [delete ref logs](https://github.com/git/git/blob/seen/refs/files-backend.c#L2817:L2817) using `REF_LOG_ONLY` but they are deleted with the owning reference. * [ref transactions](https://github.com/git/git/blob/master/refs/refs-internal.h#L197:L197) * there is a transaction hook which gets all transaction data without flags, that is old and new oid and refname, along with the 'action' indicating what happened to the transaction. * probably it should be possible to introspect transactions as they are executing, but theoretically this can also happen outside of the method itself. * [git file lock](https://github.com/git/git/blob/master/lockfile.c#L73:L86) * it looks like they are creating a tempfile with a specified name for locks (exclusive and all using atomic FS ops) which can then potentially be written in the same moment. Definitely good for loose refs that don't exist. * loose refs writing intricately knows [packed refs](https://github.com/git/git/blob/master/refs/files-backend.c#L711:L711), which makes sense in order to keep them consistent. #### File Locking * investigate [tempfile](https://docs.rs/tempfile/3.2.0) to conclude that it's certainly great as reference but won't be exactly what git does. Let's see if it's needed after all to do it exactly like that. Git definitely sets up signal handlers to delete tempfiles so probably these will have to be threadsafe or interned objects. * If directories are involved, use [raceproof file creation](https://github.com/git/git/blob/master/object-file.c#L417:L417) * [lockfile.c](https://github.com/git/git/blob/master/lockfile.c#L1:L1) holds the entire blocking implementation, including backoff. Looks like that's `git-lock`. #### Reflogs * The file is read line by line and entries are handled on the fly using iterators, easiest to use bstr::lines() there. * reverse iterators use a buffer of 1024 bytes to seek lines backwards * parsing is [here](https://github.com/git/git/blob/master/refs/files-backend.c#L1892:L1919) * for expiry the file is rewritten based on iteration * for new reflogs, these are appended (only) #### Refs Writing * [git file lock](https://github.com/git/git/blob/master/lockfile.c#L73:L86) * `cargo` uses [flock](https://github.com/rust-lang/cargo/blob/master/src/cargo/util/flock.rs#L384:L392) for comparison with different semantics. * [fslock](https://docs.rs/fslock/0.1.6/fslock/) seems a bit newer and has a few tests * [fs2](https://github.com/danburkert/fs2-rs) does not compile anymore and seems unmaintained for years now. Can do more than we need, too. * [file-lock](https://crates.io/crates/file-lock) is posix only but uses fcntl under the hood. #### Signal-Hook * The use of mutexes is unsafe as the current thread might be interrupted while holding the mutex. When trying to obtain a lock in the handler the thread will inevitably deadlock. * Memory allocation and deallocation is not allowed! So inside a handler we have to do what we do and call `std::mem::forget` to implement it correctly. ### Done Tasks * **prodash** * replace usage of ctrlc that starts yet another thread with the signal-hook iterator to process pending events from time to time as part fo the ticker thread. Saves a thread and enables proper handler chaining. * **git-features** * Replace `ctrlc` usage with signal-hook (i.e. current atexit handler for interrupts) * don't use stdout in interrupt handler as it does use a mutex under the hood. Instead allow aborting after the second interrupt in case the application is not responding. It would be great to have a lock-free version of stderr though… . * Integrate 'git-tempfile' behind feature toggle to allow interrupt handlers to be tempfile handler aware and not interfere. * replace existing usage of git_features::interrupt::is_interrupted() with versions of it that are local to the method or function. * move `git-features::interrupt` into `git-repository` as this kind of utility is for application usage only. There the `git-tempfile` integration makes sense, too. * **git-tempfile** * registered [tempfile] support to allow deletion on exit (and other signals). Use dashmap as storage. * Make sure pid is recorded to assure [forking works as expected][tempfile-fork]. * docs * fix windows build * a test validating default handlers are installed * release * race-proof creation of directories leading to the tempfile * a way to use the above for actual tempfiles * race-proof deletion of empty directories that conflict with the filename * a way to use the above for actual tempfiles * differentiate between closed and writable tempfiles in the typesystem to make choice permanent * a way to not install any handlers so that git-repository interrupt can run the tempfile removal itself right before aborting. * Make `with_mut` less cumbersome to use by assuming the interrupt handler will indeed abort. * **git-lock** - a crate providing [git-style] lock files. * lock file for update * marker for holding a lock * exponential backoff * the above with randomization * actual retries with blocking sleep * test for the above * **git-refs** * sketch transaction type * figure out whether or not to 'extend' the API to include changes from Symbolic refs to peeled ones in transactions * git signature parsing code is shared and moved to git-actor * git-object uses git-actor * git-object: unify nom error handling everywhere (to reuse the nom error handling machinery instead of re-inventing it) * git-object can use verbose errors and `()` - unit errors per feature toggle. * parse ref log line * reflog forward iteration * reflog backward iteration * file reflog writing * git-tempfile close (Handler -> Handle) * git-lock File close and Marker persist * an API to access ref logs for a reference * create single symbolic ref without reflog * split refs and reusable edit preprocessing * delete refs with reflog handling * handle parent links for 'old' oid in the log of parent refs * handle parent links for error messages of reference names (for lock errors at least) * Figure out how to deal with 'previous-value' ambiguity with create-or-update modes. * git-lock `commit()` is recoverable * commit()'ing onto empty directories can delete the directory in `git-ref` * internal reflog writing or appending for locked refs * persisting lock file onto an empty directory deletes the empty directory and tries again * create or update refs with reflog handling * research different mmap implementation but ultimately stick to fast-and-simple `filebuffer` * packed-refs iteration - important for being able to read all refs during packfile negotiation * iter packed refs from separately loaded buffer * iter loose refs with prefix * packed-refs lookup with binary search (full-paths) * packed-refs lookup with binary search (partial-paths), following lookup rules * re-add perf test of sorts, see script to generate big pack file * ~6.2mio/s in iteration and 720k/s for lookups/finds using full paths * use binary search to find start point for packed prefix iteration * iterate all refs (including packed ones) * the above, with prefix filtering * find_one uses packed-refs if available (use appropriate strategy for [reading in full or mapping]) * remove and test remaining todos * packed-refs writing and integration with transaction (_must be_) - deletions have to be propagated, updates only go to refs (I think, check) * #138 * #139 * #140 * #152 * Make sure broken/invalid loose refs don't break ref iteration and have a way to find them [reading in full or mapping]: https://github.com/git/git/blob/master/refs/packed-backend.c#L467:L473 [tempfile]: https://docs.rs/tempfile/3.2.0/temp [tempfile-fork]: https://github.com/git/git/blob/master/tempfile.c#L47:L50
Nytelife26 commented 3 years ago

What's the status on gixp clone? I'm very much interested in helping out on that front.

Byron commented 3 years ago

gixp clone as it's seen here would only clone bare repositories. The biggest requirement for achieving work tree checkouts is to implement git-index. Doing so requires a serious investment in time and great attention to detail. There may be smaller tasks on the way but ultimately, git-index is what's needed to clone a repository with work tree.

If this is outlook isn't too frightening for you, I'd be happy to get you involved in some capacity.

Nytelife26 commented 3 years ago

I have never contributed to gitoxide so I'm not too familiar with it yet, but I learn things quickly - nothing frightens me :) so yes, I'm more than happy to try things out if you give me some pointers in the right direction.

Byron commented 3 years ago

Have you had a chance to check out the backlog here? https://github.com/Byron/gitoxide/projects/1

A good way to get acquainted with gitoxide would probably be to use it by further oxidizing some crates that are using git2 ATM but could already use gitoxide. This would inevitably lead to some features being implemented or improved on on the way.

Speaking of feature, I think desperately needed is commit ancestor traversal sorted by commit time.

A way forward would be for you to find something you are comfortable to get started, then we could kick it off in a 1:1 even.

Just let me know.

PS: I connected to you on keybase, a way to reach out to me in a more realtime and private fashion, as needed.

pwnorbitals commented 2 years ago

@Nytelife26 @Byron Had the chance to get progress on this one ? :)

Byron commented 2 years ago

All building blocks for a bare clone exist, they haven't been put into a cohesive package though.

A non-bare clone is in the works which will include the bare one by its very nature.