FOGProject / fos

FOG Operating System
31 stars 33 forks source link

Erasing GPT/MBR slowness #19

Closed Sebastian-Roth closed 5 years ago

Sebastian-Roth commented 5 years ago

As posted in the forums (1, 2, 3, 4, 5) we seem to have an issue where erasing old partition tables is taking a very long time (4-5 minutes compared to a couple of seconds).

So far we know that kernel 4.15.2 is not having the issue but 4.16.6 has. We started testing and so far it looks like all 4.15.x are good.

Great find by @Quazz: https://www.clonezilla.org/downloads/stable/changelog.php

Clonezilla live 2.5.6-22 ...

  • Downgrade the Linux kernel to 4.16.16-2 due to an issue of Linux kernel 4.17 that accesses local device very slow.

Though the kernel versions don't seem to match it seems like others are seeing similar things as well.

Sebastian-Roth commented 5 years ago

Alright, in our tests it turns out that 4.16.4 introduced the issue. @Quazz tested all the versions and found that 4.15.3 to 4.16.3 were all working nice and fast. 4.16.4 and later kernels show the issue. Working on bisect the kernel commits now. More tests will follow.

Sebastian-Roth commented 5 years ago

One interesting find by @Quazz is that the issue only triggers when a normal job is scheduled. Either running sgdisk -Z by hand or scheduling a debug task does not show the same behavior! But it happens on normal tasks using 4.16.4 and newer kernels reliably. So I started building kernel images for all the commits between 4.16.3 and 4.16.4 to figure out what's exactly causing this - changelog.

Quazz commented 5 years ago

Doing this in order of testing

001 - Slow 196 - Slow 100 - Slow 150 - Slow 050 - Slow 195 - Slow 025 - Slow

First commit (196) is this one https://github.com/torvalds/linux/commit/e09070c51b280567695022237e57c428e548b355 I think.

Quazz commented 5 years ago

Since Sebastian told me that he didn't experience slowness on his Oracle VirtualBox VMs, I figured I'd play around with some settings to see if maybe I can narrow down where to look.

When I disable IO-APIC (and thus only use one core), erasing goes fast again!

I thought kernel parameter nosmp would thus have the same result, but no dice. Makes this even stranger to me!

Quazz commented 5 years ago

Interesting stuff, with IO-APIC back enabled, erasing goes slow as expected.

However, if you then (wait 3-5 seconds when it gets 'stuck') press a key such as PageUP (and others that produce weird output on the console when pressing them (interestingely enough shift+pageup seems to work too) several times in a row then the task manages to complete in a timely fashion.

I have no idea what that is, maybe something to do with the console? (since only when changes happen to the console does the task manage to complete, potentially this helps explain why debug mode is fine, too).

Just a thought (unsubstantiated), but perhaps it's not the sgdisk command getting stuck, perhaps it doesn't even getting invoked yet at all because the console output is 'still going'. Don't have anything to back that up, but it's worth looking into maybe.

edit: just tested and it is 100% sgdisk getting stuck. And it's sgdisk in general, not specifically sgdisk -Z

edit2: I used postinit scripts to replace sgdisk -Z with wipefs -a which seems to work the same and finishes instantly.

Exitcodes might be different of course, here's the man page anyway, could be interesting:

https://linux.die.net/man/8/wipefs

Sebastian-Roth commented 5 years ago

I got something horribly wrong when building those 196 kernel binaries. Not sure what exactly but. Sorry for that!

Talking to @Quazz about it I had the idea that this could be related to the kernel random number generator. It uses different types of user input to generate most valuable randomness. Keyboard input is one of them. You can hit pretty much any key on the keyboard and after 10-15 keys it got enough an finishes straight away. This is also why we don't see the issue in debug mode. Random number generator is already filled enough when we type in the commands I suppose.

Now testing step by step... 26696cdda301830a16511391a3b1515c9b3b17fb (nr. 196) = fast ab5860f5ce700bc4becc4d6abf01cc380c7ffe85 (nr. 035) = fast 1d0d9058215e75533f01fbb3db93621f142e1a3d (nr. 025) = slow 6efa23d5851f1702a3cddbdde63607ea6588b665 (nr. 032) = slow 89b59f050347d376c2ace8b1ceb908a218cfdc2e (nr. 033) = slow cd8d7a5778a4abf76ee8fe8f1bfcf78976029f8d (nr. 034) = slow

So we found which kernel commit is causing this: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cd8d7a5778a4abf76ee8fe8f1bfcf78976029f8d

leshik commented 5 years ago

Does any workaround exist for this issue when using new kernels?

Quazz commented 5 years ago

@leshik The hang is caused by a lack of available entropy for the rng which is used by the erasing step.

Currently you should either use kernel 4.15.2 (or older) or if you have to use the new kernels, you can generate entropy manually by pressing keys and moving the mouse at the station in question.

While we did find the kernel commit that causes this to occur, I think it is best to patch (or wait for an update) on the packages. Preliminary testing looks good anyway.

leshik commented 5 years ago

@Quazz does the patch exist already? I was unable to find any PRs related to crng in FOG.

Sebastian-Roth commented 5 years ago

@leshik It's not actually a rng (random number generator) issue caused by FOG code but a combination of kernel change (as posted above) and buildroot toolchain. We just figured out that newer buildroot versions fixed that issue for us (ref - search for Util-linux: Fix blocking on getrandom()). @Quazz and I already did test builds and the slowness issue is indeed fixed! But some minor hurdles come with updating to the latest buildroot version that we need to fix properly before releasing new buildroot init files for FOG.

Will be soon to come!

Sebastian-Roth commented 5 years ago

Had to fix a couple of things when updating to the latest Buildroot environment that took me a little while. Latest inits and kernels are now uploaded. To use those run:

sudo -i
cd /var/www/fog/service/ipxe
mv bzImage bzImage.orig
wget https://fogproject.org/kernels/bzImage
mv bzImage32 bzImage32.orig
wget https://fogproject.org/kernels/bzImage32
mv init.xz init.xz.orig
wget https://fogproject.org/inits/init.xz
mv init_32.xz init_32.xz.orig
wget https://fogproject.org/inits/init_32.xz

Closing this issue as fixed now!

Sebastian-Roth commented 5 years ago

@Quazz Haha, seems like we are running into the same kind of issue with 2019.02.1 but this time it's not util-linux's libuuid call but openssh ssh-keygen hanging!!

Seems like this commit in openssh later made it into Buildroot and now we see the same hang as we earlier had on "Erasing GPT/MBR ..." (hanging on uuid generation).

Sebastian-Roth commented 5 years ago

After an extensive debugging session I was able to find a nice solution to this problem. Fixed in 61abe486.

I tracked it down to where ssh-keygen calles RAND_status and that call seems to block on some machines, maybe just virtual machines. Adding haveged is working great to fill the entropy pool on bootup so ssh key generation is not hanging and the user does not have to add entropy by using the keyboard either.