KhronosGroup / Vulkan-Samples

One stop solution for all Vulkan samples
Apache License 2.0
4.36k stars 649 forks source link

Problems cloning repository submodules #1180

Closed oddhack closed 1 month ago

oddhack commented 1 month ago

Update: the issue is to have the full repository clone, with submodules, work without submodule fatal errors. @gpx1000 volunteered to look at this, although they are not running into this on their own home network which may make reproduction difficult. However, both @SaschaWillems and myself are running into it frequently. Desired outcome is first, change the repository so this does not happen; or if that's not possible, give better advice in the README as to how to recover when it does happen.

I spun off the issue of how to build just the Vulkan-Samples component of docs.vulkan.org as #1181 as that's orthogonal to the submodule cloning problem.


When I try

git clone --recurse-submodules git@github.com:KhronosGroup/Vulkan-Samples.git

as described in the README, I'm getting errors of the following form in the submodule cloning:

Cloning into 'Vulkan-Samples'...
Submodule 'assets' (https://github.com/KhronosGroup/Vulkan-Samples-Assets) registered for path 'assets'
...
Submodule 'third_party/vulkan' (https://github.com/KhronosGroup/Vulkan-Headers) registered for path 'third_party/vulkan'
...
Cloning into '/home/tree/git/Vulkan-Site/Vulkan-Samples/third_party/fmt'...
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 7908 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
fatal: clone of 'https://github.com/fmtlib/fmt' into submodule path '/home/tree/git/Vulkan-Site/Vulkan-Samples/third_party/fmt' failed
Failed to clone 'third_party/fmt'. Retry scheduled
Cloning into '/home/tree/git/Vulkan-Site/Vulkan-Samples/third_party/glfw'...
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 7539 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
fatal: clone of 'https://github.com/glfw/glfw' into submodule path '/home/tree/git/Vulkan-Site/Vulkan-Samples/third_party/glfw' failed
Failed to clone 'third_party/glfw'. Retry scheduled
Cloning into '/home/tree/git/Vulkan-Site/Vulkan-Samples/third_party/glm'...
...
Failed to clone 'third_party/imgui' a second time, aborting

I'll grant this could just be something about my ISP (AT&T Fiber, haven't had any issues in the last 10 months that weren't just temporary outages). But this operation feels very fragile - are there options that can help? If I then go into the clone and

git submodule update

Then this does not complain. But the submodules that failed seem to just contain a .git directory and nothing else afterwards. Then

cmake -H"." -B"build/unix" -DVKB_GENERATE_ANTORA_SITE=ON

gets a small distance into the build and starts failing with

...
-- Plugin `window_options` - BUILD
-- Configuring done
CMake Error at framework/CMakeLists.txt:462 (add_library):
  Cannot find source file:

    /home/tree/git/Vulkan-Site/Vulkan-Samples/third_party/CTPL/ctpl_stl.h

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
  .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc

Here the CTPL directory just contains .git and nothing I've tried with submodule changes that situation. I can pull the underlying submodule's repository OK independently of the samples repo submodule setup - but not convince git that the submodule itself needs to be updated / replaced.

It would be really helpful if the README described these sorts of scenarios and advised how to work around them as this is far outside my minimal knowledge of submodules. I've tried removing Vulkan-Samples and re-cloning - the same sorts of failures (but not the same specific submodules) keep happening. It takes 20-30 minutes for a complete cycle with all the retries to complete and it's not easy to test.

SaschaWillems commented 1 month ago

error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8) error: 7908 bytes of body are still expected fetch-pack: unexpected disconnect while reading sideband packet fatal: early EOF

Looks like github is having a bad moment (again). When your connection is flacky or github has server issues, cloning submodules tends to fail with above message.

Nothing we can fix and the only "solution" is to try fetching submodules until it works. Our repo is kinda prone to it as we use lots of submodules. For submodules that have only been fetched partially afaik you can dir into that folder and manually fetch there?

oddhack commented 1 month ago

Is there some way to "try fetching submodules" short of cleaning the entire tree and starting from scratch? I can't spend 20-30 minutes on every failure. 'git submodule update' does nothing in the failed submodules, as best I can tell - nor does it complain. The submodule structure is messed up after the clone, but git does not seem to be aware that it's messed up, which is heinous.

oddhack commented 1 month ago

I see you edited that while I was typing. Will attempt that but really, git?

SaschaWillems commented 1 month ago

git (and github) being so flaky is the reason I personally only use submodules where I have to. And our sample's repo uses so many submodules that it's very prone to failure. I run into the exact same problem your report several times a week and it's always frustrating.

SaschaWillems commented 1 month ago

If you only need to clone for building the documentation, we should find a way to make the Asciidoc part build without having to clone the whole repo. @gpx1000 any ideas on how to achieve that? Building docs doesn't require any submodule, so if they could be skipped things would probably be a lot easier for Jon.

oddhack commented 1 month ago

If you only need to clone for building the documentation, we should find a way to make the Asciidoc part build without having to clone the whole repo. @gpx1000 any ideas on how to achieve that? Building docs doesn't require any submodule, so if they could be skipped things would probably be a lot easier for Jon.

That would be great as it drops the repo cloning time from substantial fractions of an hour to seconds (and frees up 3 GB on my SSD). However, doing the 'cmake -H"." -B"build/unix" -DVKB_GENERATE_ANTORA_SITE=ON' then throws a lot of errors regarding missing CMakeLists files and targets and such, so it sounds like some substantial work on the cmake configuration would be needed.

oddhack commented 1 month ago

Agreed re submodules, they are the spawn of Satan.

SaschaWillems commented 1 month ago

In other projects we did split the compiler project and documentation setup/build processes. I think having a separate cmake/make file or even something simpler for the samples repo is a workable solution.

gpx1000 commented 1 month ago

Have you tried git submodule update --init --recursive

submodules are a part of git and are rather quite mature and stable when used correctly.

gpx1000 commented 1 month ago

I'll come up with a method of building docs without needing any further projects downloaded but this one... Gimme time to think it through.

SaschaWillems commented 1 month ago

Maybe something similar to what I did with the tutorial: https://github.com/KhronosGroup/Vulkan-Tutorial/tree/main/antora?

A separate documentation makefile with some python script to do the heavy lifting (which might not be required for this repo).

gpx1000 commented 1 month ago

Well if I can, I'd rather prefer to use CMake; and not pollute the project with other build systems. However, I do like what you did with Vulkan-Tutorial.

oddhack commented 1 month ago

Have you tried git submodule update --init --recursive

submodules are a part of git and are rather quite mature and stable when used correctly.

I am just following the recipe in the README. Git isn't responsible for network issues (arguably, at least), but I don't think it's on the user that after failing to download the submodules and throwing a bunch of error messages, git leaves the failed submodules in a broken state with no advice as to how to fix them.

Since this appears to be happening with some regularity to one of the repository admins as well, adding further advice to the README about how to recover from it seems useful.

gpx1000 commented 1 month ago

git config --global http.lowSpeedLimit 0 # Disable low speed limit git config --global http.lowSpeedTime 999999 # Set low speed time limit to a large value

I think this might stem from lower bandwidth. The above might help. Please report back if issue persists. Once we know what the solution is, we can update the README as appropriate.

oddhack commented 1 month ago

git config --global http.lowSpeedLimit 0 # Disable low speed limit git config --global http.lowSpeedTime 999999 # Set low speed time limit to a large value

I think this might stem from lower bandwidth. The above might help. Please report back if issue persists. Once we know what the solution is, we can update the README as appropriate.

Currently I'm getting clone speeds of 30-50 KiB/s from github - and 25 MiB/s from Khronos gitlab. Curious if you're seeing anything like that. Maybe github is just extremely congested. It does not make a difference whether I'm cloning Vulkan-Samples or another github repo, does not make a difference whether submodules or not , same dismal download performance.

oddhack commented 1 month ago

The problem with testing is that it takes ca. half an hour of this sluggish performance before the fatal errors occur. If I crank up the lowSpeedTime timeout then it's very possible I would be waiting lowSpeedTime seconds, or until github rebooted their servers, whichever comes first.

@outofcontrol are you seeing this kind of github speed throttling going on at your end? E.g. 'git clone --recurse-submodules git@github.com:KhronosGroup/Vulkan-Samples.git' getting delivered performance in the range of 30-50 KiB/s when it should be hundreds of times that.

gpx1000 commented 1 month ago

I'm getting between 12 and 18MiB/s

running:

time git clone --recurse-submodules git@github.com:KhronosGroup/Vulkan-Samples.git

yields the following results: real 2m33.818s user 2m9.345s sys 0m18.665s

I do see people reporting that the GitHub CLI is faster for them and others posit that this is caused by negotiating the security and that using a personal access token might improve your speed.

outofcontrol commented 1 month ago

Running locally with:

\time git clone --recurse-submodules git@github.com:KhronosGroup/Vulkan-Samples.git
86.00 real        78.62 user        18.70 sys

Total size is 3.4G which ~ 40MiB/s if I am not mistaken?

oddhack commented 1 month ago

I do see people reporting that the GitHub CLI is faster for them and others posit that this is caused by negotiating the security and that using a personal access token might improve your speed.

I'm not sure what the "GitHub CLI" is? I'm using the command-line git client. Having a hard time imagining what sort of authentication might cause a 500x slowdown.

oddhack commented 1 month ago

I tried using 'gh clone' and am getting (edit: far worse behavior) than plain 'git'. It took 30 minutes to complete and every single submodule failed to download, though the main repo did. 'gh auth status' says

Possibly the behavior is related to the ISP's network configuration. I tried switching from AT&T's nameserver to the Google 8.8.8.8 but continue to see some of the submodules download at glacial speeds and fail (either way, most download at very reasonable 150-400 Mbps rates - it is only a couple of submodules in each attempt, and not the same ones each time, either).

I have seen suggestions of going through a VPN to avoid ISP routing issues and I might try Sonic's self-hosted VPN service, not having a commercial VPN. Not a solution that would be generally useful if it were to work.

gpx1000 commented 1 month ago

It is possible that a VPN showing you're coming from another country will do the trick? Obviously this isn't something that we can correct with a README or documentation. However, maybe?

KhronosWebservices commented 1 month ago

Is it possible there are some transient issues in a router somewhere, which might be discoverable with mtr or traceroute?

oddhack commented 1 month ago

Is it possible there are some transient issues in a router somewhere, which might be discoverable with mtr or traceroute?

I don't know what to make of it but there is certainly a lot of packet loss reported at sites in the middle by mtr - though not at the destination.

mtr

OTOH if I run mtr against gitlab.com, where I have had zero problems recently, the mtr results look very similar, several sites in the middle with '???' hostname and 100% packet loss reported. If you can suggest a line of approach to get this past an L1 AT&T CSR / "AI" chatbot to someone who might actually be able to do something with their network, I'm all ears.

oddhack commented 1 month ago

Possibly slightly related, for some months now I've been having periodic problems loading github.com PR / Issue pages in Chromium and it is going on with a vengeance at the moment. Reloading in the tab does nothing, opening a new tab with the same URL sometimes works after 3 or 4 tabs. Firefox has no problems at all. It sort of sounds like the same problem with packet loss corrupting sessions although Chromium reports no errors, just sits there and spins forever.

KhronosWebservices commented 1 month ago

Might be worth pinging the OrgTechEmail when doing a whois on 192.205.32.182? Amazingly enough, I've had replies from doing a similar outreach in past years.

A VPN might reroute you around the problem servers?

oddhack commented 1 month ago

Is OrgTechEmail something from the DNS record? Not familiar with the term.

KhronosWebservices commented 1 month ago

In the WhoIs look up, there are several fields for contact, OrgTechEmail is the Organization Technical Email contact for that 192.205.0.0/16.

oddhack commented 1 month ago

In the WhoIs look up, there are several fields for contact, OrgTechEmail is the Organization Technical Email contact for that 192.205.0.0/16.

They never replied, and attempts to pursue this through AT&T's "customer service" work about as well as you might expect (including initial denials that AT&T had either an internal network, or network engineers responsible for maintaining it). Now they want to replace my modem :-(

Running through Sonic's OpenVPN service, downloads are robust - they run at about 1/8th the performance I see from github w/o VPN (when it's not stalled), but at least that's better than running at 1/1000th that performance when it is stalled, and the VPN setup is not failing due to timeouts. Supposedly WireGuard-based VPNs come a lot closer to the connection capacity but that's not an option here.

I think the takeaway here is that AT&T has a problem they are unwilling to diagnose - so my only options seem to be to run a higher performance VPN, or switch ISPs. Fortunately I think Sonic may have finally rolled fiber in my neighborhood, as I've been waiting for since 2018, so that may be the right option. It seems advisable to make a note of this in the README since at least @SaschaWillems appears to be suffering similar problems - I would be curious if your using a VPN also improves the situation when you're suffering this behavior.

SupinePandora43 commented 1 month ago
git clone git@github.com:KhronosGroup/Vulkan-Samples.git
cd Vulkan-Samples
perl -i -p -e 's|https://(.*?)/|git@\1:|g' .gitmodules
git submodule sync
git submodule update

I was able to get it working by using ssh instead

had troubles with CTPL, fixed it with:

git submodule deinit -f third_party/CTPL
git submodule update --init
oddhack commented 1 month ago

@SupinePandora43 thanks! This does seem to help, for me, but there are issues with ssh protocol that may make it hard to just drop into the repository as the default behavior - see #1206. If you know more about the tradeoffs and could comment, that would be welcome.