GitoxideLabs / gitoxide

An idiomatic, lean, fast & safe pure Rust implementation of Git
Apache License 2.0
9.17k stars 315 forks source link

[Azure Devops Repos] clone failure as packs use REF_DELTA objects: fix with git-style delta resolution #1025

Open bittrance opened 1 year ago

bittrance commented 1 year ago

Current behavior 😯

It seems that Azure Devops (at least over SSH) has some form of pagination when reading packs. Given the code below, fetching a trivial repo (one file, one commit) will work correctly produce the following output:

created handshake
created negotiate (round 1)
created receiving pack
created remote
created read pack
created remote: Azure Repos
Found 3 objects to send. (20 ms)
Some("read pack") incremented to Some(759)
created create index file
created indexing
created decompressing
Some("decompressing") incremented to Some(985)
Some("indexing") incremented to Some(1)
Some("decompressing") incremented to Some(1022)
Some("indexing") incremented to Some(2)
Some("decompressing") incremented to Some(1214)
Some("indexing") incremented to Some(3)
Some("read pack") Info done 759 items in 0.00s (223952 items/s)
Some("indexing") Info done 3 items in 0.00s (2480 items/s)
Some("decompressing") Info done 1214 items in 0.00s (994769 items/s)
Some("create index file") incremented to Some(1)
created Resolving
created Decoding
created thread 0
created thread 1
created thread 3
created thread 2
created thread 4
created thread 5
created thread 7
created thread 6
Some("Resolving") Info done 0 items in 0.05s (0 items/s)
Some("Decoding") Info done 0 items in 0.05s (0 items/s)
Some("create index file") incremented to Some(2)
created sorting by id
Some("create index file") incremented to Some(3)
created writing index file
created writing fan-out table
Some("writing index file") incremented to Some(1)
created writing ids
Some("writing index file") incremented to Some(2)
created writing crc32
Some("writing index file") incremented to Some(3)
created writing offsets
Some("writing index file") incremented to Some(4)
Some("writing index file") Success done 1.2KB in 0.00s (5.0MB/s)
Some("create index file") Success done 3.0 objects in 0.05s (57.0 objects/s)

When fed the URL for a a non-trivial repo, output will look like this. Note the suspicious 8192 on read pack progress. I've tested a few different repos and they all get 8192 "pack steps".

created handshake
created negotiate (round 1)
created receiving pack
created remote
created read pack
created remote: Azure Repos
Found 679 objects to send. (200 ms)
Some("read pack") incremented to Some(8192)
created create index file
created indexing
created decompressing
Some("decompressing") incremented to Some(985)
Some("indexing") incremented to Some(1)
Some("decompressing") incremented to Some(1022)
Some("indexing") incremented to Some(2)
Some("decompressing") incremented to Some(1214)
Some("indexing") incremented to Some(3)
Some("decompressing") incremented to Some(1488)
Some("indexing") incremented to Some(4)
Some("decompressing") incremented to Some(1770)
Some("indexing") incremented to Some(5)
Some("decompressing") incremented to Some(2259)
Some("indexing") incremented to Some(6)
Some("decompressing") incremented to Some(2997)
Some("indexing") incremented to Some(7)
Some("decompressing") incremented to Some(3803)
Some("indexing") incremented to Some(8)
Some("decompressing") incremented to Some(3838)
Some("indexing") incremented to Some(9)
Some("decompressing") incremented to Some(5320)
Some("indexing") incremented to Some(10)
Some("read pack") incremented to Some(16384)
Some("read pack") incremented to Some(24576)
Some("decompressing") incremented to Some(7051)
Some("indexing") incremented to Some(11)
Some("decompressing") incremented to Some(61187)
Some("indexing") incremented to Some(12)
Some("decompressing") incremented to Some(61223)
Some("indexing") incremented to Some(13)
Some("decompressing") incremented to Some(61432)
Some("indexing") incremented to Some(14)
Some("decompressing") incremented to Some(62040)
Some("indexing") incremented to Some(15)
Some("decompressing") incremented to Some(62671)
Some("indexing") incremented to Some(16)
Some("decompressing") incremented to Some(62788)
Some("indexing") incremented to Some(17)
Some("decompressing") incremented to Some(64644)
Some("indexing") incremented to Some(18)
Some("decompressing") incremented to Some(64789)
Some("indexing") incremented to Some(19)
Info done 24576 items in 0.23s (106798 items/s)

At this point it hangs. Eventually, the timeout occurs and the following error is raised:

Error: Fetch(WritePack(IndexWrite(PackEntryDecode(NotFound { object_id: Sha1(cdd16b2c111f7de5457e4c3182ed4d781a062a2a) }))))

Tho I suppose this is just a consequence of us not having received the object in question.

I'll try to dig deeper tomorrow.

Expected behavior 🤔

If it works for small Azure Devops repos, surely it should work for large :) The same code has no problem fetching a GitHub repo of non-trivial size.

Steps to reproduce 🕹

Code to reproduce the problem:

#[derive(Default)]
struct P {
    max: Option<Step>,
    step: Option<Step>,
    unit: Option<Unit>,
    name: Option<String>,
}

impl Progress for P {
    type SubProgress = P;

    fn add_child(&mut self, name: impl Into<String>) -> Self::SubProgress {
        let mut p = P::default();
        p.set_name(name.into());
        p
    }

    fn add_child_with_id(&mut self, name: impl Into<String>, _id: Id) -> Self::SubProgress {
        let mut p = P::default();
        p.set_name(name.into());
        p
    }

    fn init(&mut self, max: Option<Step>, unit: Option<Unit>) {
        self.max = max;
        self.unit = unit;
    }

    fn set(&mut self, step: Step) {
        self.step = Some(step);
        println!("{:?} start at {:?}", self.name, self.step);
    }

    fn step(&self) -> Step {
        self.step.clone().unwrap_or_default()
    }

    fn inc_by(&mut self, step: Step) {
        *self.step.get_or_insert(Step::default()) += step;
        println!("{:?} incremented to {:?}", self.name, self.step);
    }

    fn set_name(&mut self, name: impl Into<String>) {
        let name = name.into();
        println!("created {}", name);
        self.name = Some(name);
    }

    fn name(&self) -> Option<String> {
        self.name.clone()
    }

    fn id(&self) -> Id {
        todo!()
    }

    fn message(&self, level: MessageLevel, message: impl Into<String>) {
        println!("{:?} {:?} {}", self.name, level, message.into());
    }
}

fn clone_repo(
    config: &GitConfig,
    deadline: Instant,
    target: &Path,
) -> Result<Repository, Error> {
    let should_interrupt = AtomicBool::new(false);
    let cancel = AtomicBool::new(false);
    scope(|s| {
        s.spawn(|| {
            while Instant::now() < deadline && !cancel.load(Ordering::Relaxed) {
                sleep(Duration::from_secs(1));
            }
            should_interrupt.store(true, Ordering::Relaxed);
        });
        let p = P::default();
        let (repo, _outcome) = gix::prepare_clone(config.url.clone(), target)
            .unwrap()
            .fetch_only(p, &should_interrupt)?;
        cancel.store(true, Ordering::Relaxed);
        Ok(repo)
    })
}
Byron commented 1 year ago

Thanks for reporting. The effort to provide a reproducer is appreciated, even though I'd hope for a public repo URL that can be downloaded with it.

Also, which version of gix are you using? If it's not the latest, did you try to include gix as git-dependency so the latest main can be used? Further, you can try setting fetch.negotiationAlgorithm to different values to see if that resolves the issue or alters the outcome. I'd recommend just using gix clone (CLI) for testing. I'd expect that it runs in to the same issue.

When fed the URL for a a non-trivial repo, output will look like this. Note the suspicious 8192 on read pack progress. I've tested a few different repos and they all get 8192 "pack steps".

A BufReader is used internally and 8192 is the default buffer size.

Error: Fetch(WritePack(IndexWrite(PackEntryDecode(NotFound { object_id: Sha1(cdd16b2c111f7de5457e4c3182ed4d781a062a2a) }))))

This means that the remote sent a thin pack, which referenced a local object that is supposed to exist, for use as a base object. The only reason I imagine this could happen is if the negotiation step leaves the server incorrectly thinking that this object exists locally, but since gix now negotiates just like git does, I'd not expect that to happen. Older versions of gix used a naive negotiation algorithm which might have had such side-effects, particularly with custom server implementations that might have bugs on their own. After all, even in older versions of gix I have never heard of this happening.

bittrance commented 1 year ago

Contrary to my understanding, it is possible to have public repos on Azure DevOps with an MSDN subscription. I created https://dev.azure.com/bittrance/public/_git/gitoxide / git@ssh.dev.azure.com:v3/bittrance/public/gitoxide if you want to perform your own tests. The behaviour seems to be the same on HTTPS.

@Byron As you correctly noted, the example code was not on gitoxide main branch (it was release 0.51.0 if it matters). I updated my code to a git ref and re-ran my code on above repo. The read pack stage now seems to run to completion (Some("read pack") Info done 8459627 items in 6.57s (1287093 items/s)) but it still fails with Fetch(WritePack(IndexWrite(PackEntryDecode(NotFound { object_id: Sha1(daa3b520fbbb247e8057fed0e7eda8320974af77) })))).

The hanging behaviour is a property of my code in that the ? case in the scope() will block until the watchdog thread hits the deadline before returning. In fact fetch_only() fails immediately after having read the packs.

I'll be happy to dive deeper, but it may be some days before I have the time.

bittrance commented 1 year ago

I added a trivial repository https://dev.azure.com/bittrance/public/_git/die-kleine that can be used to verify that very small repos can be cloned successfully.

Byron commented 1 year ago

That's perfect, thank you! I will take a closer look but believe to have everything I need for a fix.

First of all, gix clone https://dev.azure.com/bittrance/public/_git/gitoxide does reproduce the issue - it stops fetching the pack after 8.5MB and then can't find the object daa… during pack-resolution. git clone https://dev.azure.com/bittrance/public/_git/gitoxide works perfectly though.

If you are keen to investigate before me, I have a couple of things to finish until the 25th (at most), you could use gix -c http.proxy <proxy> clone … and compare what you see with git -c http.proxy <proxy> clone ….

Please note that I couldn't reproduce it with the SSH url as it doesn't know me, but that won't be necessary as you validated it's the same behaviour. That's great as it means it's unrelated to the underlying transport.

Byron commented 1 year ago

Investigation

After a short moment of investigation it is clear that this is the code that cancels the operation: the pack, which is supposed to be complete, contains a ref-delta object, which means in order to apply this delta, it needs the named base object. Since the object doesn't exist - after all, the local repo is being populated - the lookup fails and everything unwinds.

I think what's happening here is the server is old enough to encode delta-bases with full hash instead of just the offset to a prior object. Maybe this is done because they don't sort their objects so that the base actually appears before the object referencing it, or maybe there is another reason.

git can do it because apparently it doesn't assume that base object references are used only to refer to locally existing objects, whereas all other delta-bases must be self-referential within a properly ordered pack.

Diving deeper, it turns out that the other end is a V1 server, which responds exactly the same for both gix and git:

001e# service=git-upload-pack
000000c114e0763cbf368bda476046a0fd28be230d67b1bd HEAD� multi_ack thin-pack side-band side-band-64k no-progress multi_ack_detailed no-done shallow allow-tip-sha1-in-want symref=HEAD:refs/heads/main
003d14e0763cbf368bda476046a0fd28be230d67b1bd refs/heads/main
0000

From there gix requests a pack like this:

009bwant 14e0763cbf368bda476046a0fd28be230d67b1bd thin-pack side-band-64k shallow multi_ack_detailed allow-tip-sha1-in-want no-done agent=git/oxide-0.53.1
0032want 14e0763cbf368bda476046a0fd28be230d67b1bd
00000009done

and git does so using:

0065want 14e0763cbf368bda476046a0fd28be230d67b1bd multi_ack_detailed no-done side-band-64k thin-pack
00000009done

Note that both implementations ask for Version 2 of the protocol, but only get V1.

gix generally tends to enable all features, and in this case, thin-pack is the one that causes ref-deltas to be allowed to point to known objects. The server should know that we have no known objects, but in any case, git also enables thin-pack so should receive the same kind of pack.

I may repeat the first sentence of the documentation of thin-pack:

A thin pack is one with deltas which reference base objects not contained within the pack (but are known to exist at the receiving end).

If that's to be believed, the server definitely sends an invalid pack as the client never advertised any HAVE line, it has no object.

With a quick-hack I removed the thin-pack capability advertisement when asking for a pack, which should tell the server not to produce packs that refer to objects outside of the pack itself.

0091want 14e0763cbf368bda476046a0fd28be230d67b1bd side-band-64k shallow multi_ack_detailed allow-tip-sha1-in-want no-done agent=git/oxide-0.53.1
0032want 14e0763cbf368bda476046a0fd28be230d67b1bd
00000009done

(The request doesn't mention "thin-pack" anymore)

However, the server still answers with a thin pack, which is another reason for assuming the other side is some custom, none-standard implementation.

Next up was to see if git sees a different pack than gix for other reasons.

❯ /bin/ls -l ~/Desktop/Response_Body_09-21-2023-11-26-48.folder/\[*
-rw-r--r--  1 byron  staff  45633504 Sep 21 11:27 /Users/byron/Desktop/Response_Body_09-21-2023-11-26-48.folder/[20] Response - dev.azure.com_bittrance_public__git_gitoxide_git-upload-pack
-rw-r--r--  1 byron  staff  45633504 Sep 21 11:26 /Users/byron/Desktop/Response_Body_09-21-2023-11-26-48.folder/[70] Response - dev.azure.com_bittrance_public__git_gitoxide_git-upload-pack

gitoxide ( fix-1025) +2 -4 [$!]
❯ sha1sum ~/Desktop/Response_Body_09-21-2023-11-26-48.folder/\[*
536fe9a59ff7a834f1ece763a4709e4092b7a480  /Users/byron/Desktop/Response_Body_09-21-2023-11-26-48.folder/[20] Response - dev.azure.com_bittrance_public__git_gitoxide_git-upload-pack
d022148adf21e2d1ae00f70ac4818e2705fba97d  /Users/byron/Desktop/Response_Body_09-21-2023-11-26-48.folder/[70] Response - dev.azure.com_bittrance_public__git_gitoxide_git-upload-pack

(one file received by gix, the other by git)

After removing the leader of the file which had a varying bit, the amount of milliseconds to ready the pack, I get this:

❯ sha1sum ~/Desktop/Response_Body_09-21-2023-11-26-48.folder/\[*
bbf11ac0f17cdcfe16a6009654396cd91fa96457  /Users/byron/Desktop/Response_Body_09-21-2023-11-26-48.folder/[20] Response - dev.azure.com_bittrance_public__git_gitoxide_git-upload-pack
bbf11ac0f17cdcfe16a6009654396cd91fa96457  /Users/byron/Desktop/Response_Body_09-21-2023-11-26-48.folder/[70] Response - dev.azure.com_bittrance_public__git_gitoxide_git-upload-pack

Indeed, it's the same pack is sent no matter what.

Further Study

Here is what I find amazing: git is able to resolve the pack nonetheless, but gix isn't for good reasons. It's amazing that git can do it as this would mean that while resolving a pack it keeps the index-so-far available for access by the resolver so it can actually figure out the offset to the base object in the pack (assuming it was mentioned already) from its ID. For fun, I 'hacked' the portion of the code that causes the error when streaming the pack but it fails right after with…

Error: Ref delta objects are not supported as there is no way to look them up. Resolve them beforehand.

…just because it's utterly impossible for gix pack resolution engine to deal with ID->offset resolution in the part of the program that is supposed to resolve pack entries to produce said ID->offset association which is stored in a pack index. Resolve them beforehand is not possible as the hashes of these entries aren't known yet.

Reading further, I found the ofs-delta option that indicates that apparently, thin-pack is entirely unrelated to how delta-objects are encoded, but instead it's related of whether or not PACKV2 is supported. As a matter of fact, the streaming reader only supports V2 packs, and V1 never was a thing unless the implementation is very old maybe. Unfortunately, ofs-delta isn't anything the azure server seems to support, at least it's not advertised. gix already sets it in V2 servers, for good reason, but in V1 it's not exactly possible.

I even tried sending ofs-delta along with the V1 pack request in the off-chance that the server now avoids IDs in packs, but to no avail, it still sends the very same pack.

Looking into it further, I realized that I never really studied how git unpacks objects (probably I just forgot) but went right away for the implementation that made most sense to me, which unknowingly made it very incompatible to supporting the kind of packs that this server is sending.

Anyway, I think in git it's here where the magic is happening: It stores the ref-delta, to later lookup the offset of the base object which has already been resolved - thus its ID is known.

Can gix be 'fixed'?

When looking at the reason git can resolve such a pack (which really has no reason to exist as ref-deltas are inefficient compared to a delta-encoded offset) it turns out that git explicitly supports such a case, which, against all odds, seems valid. In a way, gix has been over-fitted to deal with 'modern' packs only.

It also, clearly, is a very old way of building packs, which affects my willingness to try hard to make this possible. Unfortunately, there is no simple fix either, as the "gix pack resolution engine" really is just a pack traversal that leverages a data structure for perfect multi-threading without doing any unnecessary computation. This also leads to that data structure knowing only offsets, but never hashes - it only decodes objects and the hash is created by the index creation routine. And the reason a solution with this implementation is impossible is that the actual tree of deltas is built ahead of time, and each base is referred to only by its offset. What further complicates this is that the data structure itself is optimized for size, and adding to it will make it uncompetitive compared to the memory requirements of git.

Conclusion

At this time, I don't think there can be a fix for this as it's architecturally impossible, and not incentivised as the current engine is incredible fast and efficient while dealing with 'normal' packs. To support packs that use ref-deltas to refer to objects within the pack, one would have to change the resolution engine to something that is similar to the implementation that git uses. It's worth noting that it seemingly can also deal with ref-deltas that refer to base objects anywhere in the pack. Implementing this might happen at some point if it turns out that other features, like promisor objects, can't be support in the way it's currently done, but it's not planned as of now.

bittrance commented 1 year ago

Well, that was disappointing, but not entirely surprising. I had the "Test with azdo" task high on my todo because I know the azdo Git/SSH server is homegrown and quirky. I note that go-git also has problem supporting azdo: https://github.com/go-git/go-git/issues/64 (or the archived https://github.com/src-d/go-git/issues/1058) . go-git does not support mutli_ack so it gets a somewhat different issue.

I respect the decision to kick this can down the road. AzDO is relatively common in the (MS dominated) corporate world so gix may have to confront it at some point, but it is unlikely to be a big thing with early adopters.

I'll leave it up to you to decide whether you want to keep this issue open for discoverability or close it.

Thank you for your detailed problem analysis. :heart:

Byron commented 1 year ago

I respect the decision to kick this can down the road. AzDO is relatively common in the (MS dominated) corporate world so gix may have to confront it at some point, but it is unlikely to be a big thing with early adopters.

I'll leave it up to you to decide whether you want to keep this issue open for discoverability or close it.

I think there is value in having a possibly user-selectable alternative implementation for pack resolution, and think that it could work very similarly to how git currently does it for maximum compatibility. Contributions for this are welcome, even though I don't expect this to actually be contributed as it's a difficult topic for sure.

Maybe one day Microsoft choses to update their server side and there is no demand anymore, or they sponsor this feature, let's wait and see.

LuaKT commented 7 months ago

Hi, I think I'm also hitting this same issue with this repository I discovered, except it's hosted on Github: https://github.com/rodrigocadiz47/DApp

WritePack(IndexWrite(PackEntryDecode(NotFound { object_id: Sha1(3825fe6b335743b7b85eb2ee017fb3eda0529f1e) })))
Error: A pack entry could not be extracted

Caused by:
    The object 3825fe6b335743b7b85eb2ee017fb3eda0529f1e could not be decoded or wasn't found

Is it possible for this issue to affect some repositories on Github too or should I open a new issue for this?

Byron commented 7 months ago

Thanks for letting me know, and for finding just the right issue to put it in!

It's really good to know that this isn't exclusively an Azure DevOps problem. Maybe it has been a good problem a couple of years ago, and Azure never moved on from that. That theory seems to be somewhat justified by seeing that the most recent commit in DApp was two years ago. I am pretty sure that GitHub has this pack cached, and maybe it was created with a version of Git that produced this kind of 'buggy' (as in larger than needed) pack.

To put it to the test, I forked the Repo, reproduced the issue in the new fork, and made minor changes to the Readme. The issue still reproduces perfectly, probably due to Git just doing a pack-to-pack copy so the objects stay exactly the same.

❯ gix clone https://github.com/Byron/DApp-Pack-Testing
 15:49:21 read pack done 8.2KB in 0.03s (276.8KB/s)
Error: A pack entry could not be extracted

Caused by:
    The object 3825fe6b335743b7b85eb2ee017fb3eda0529f1e could not be decoded or wasn't found
NicholasLYang commented 6 days ago

This issue has showed up for me when implementing a toy git server with gitoxide's infrastructure. When pushing to a completely empty repo, my local git (git version 2.35.3) seems to send a pack with ref deltas that reference objects in the same pack.

Byron commented 6 days ago

Thanks a lot for sharing, even though I definitely don't like what I hear. After all, this would mean that for some reason, recent Git also does that even though it really shouldn't make any sense to do - size-wise it's always better to use offset-deltas to refer to previous objects, so that ref-deltas are only used to refer to out-of-pack objects (which don't exist if the remote is empty).