fosslinux / live-bootstrap

Use of a Linux initramfs to fully automate the bootstrapping process
482 stars 32 forks source link

Kernel Panic when building with QEMU #353

Closed ajherchenroder closed 7 months ago

ajherchenroder commented 8 months ago

I received a kernel panic when I ran the new version under QEMU. I reran the bootstrap using a chroot and it completed successfully. Here is the the error message I received:

gcc-13.1.0: build successful [20776.937384] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00 [20776.937384] [20776.937991] CPU: 1 PID: 1 Comm: init Not tainted 4.9.10-gnu_1 #1 [20776.938451] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch4 [20776.939174] Call Trace: [20776.939527] Kernel Offset: 0xc000000 from 0xc1000000 (relocation range: 0xc0) [20776.940466] ---[ end Kernel panic - not syncing: Attempted to kill init! exi0 [20776.940466]

The command line was:

./rootfs.py -p --update-checksums --build-kernels -q

At this point I'm not sure if this is a "me" problem or a genuine issue. Can someone please attempt to duplicate the issue. In addition I can't seem to find the new auto-generated bash scrips to even begin to troubleshoot the issue. Any help would be greatly appreciated.

fosslinux commented 8 months ago

This is a kind of "not-a-bug but bad behaviour" kind of thing.

This is fully expected behaviour right now! Just to clarify what is going on here;

  1. The new kernel runs /init script
  2. This calls the build script which builds all the packages
  3. After gcc-13.1.0 is built, the build script ends, and then the init script ends
  4. This causes the kernel to panic "Attempted to kill init!", as init ended (which should never really happen on a proper Linux system)

I'll readd the bash prompt at the end of the build that we had before the new version.

ajherchenroder commented 8 months ago

Ok I think I understand what you’re saying. So in effect, when the build script ends, process 0 is terminated causing the kernel panic. This brings up an unintended consequence of the new build system. I run in QEMU almost exclusively. Under the new system the generated batch files are not available in the /tmp directory like they are in a chroot. They are generated inside the image that QEMU has mounted. I am trying to figure out a way to get at them for troubleshooting/auditing purposes. Right now I have to wait for the run to crash and then mount the image to investigate. There doesn’t seem to be a way to look at them before they run.

fosslinux commented 8 months ago

Ok I think I understand what you’re saying. So in effect, when the build script ends, process 0 is terminated causing the kernel panic. This brings up an unintended consequence of the new build system.

Correct.

They are generated inside the image that QEMU has mounted. I am trying to figure out a way to get at them for troubleshooting/auditing purposes. Right now I have to wait for the run to crash and then mount the image to investigate. There doesn’t seem to be a way to look at them before they run.

They are generated within a few minutes of the bootstrap starting - but at that point it's really hard to get to files. The easiest way is to wait until the Linux transition occurs and then mount the disk (they'll already be there, you don't have to wait for it to finish completely). Alternatively inject a shell into an earlier stage and observe them there.

Why do you want to do that, though? If you want to look at the generated scripts, chroot/bwrap mode would make a lot more sense. If you then also want the QEMU scripts, modify steps/bootstrap.cfg within chroot to match a QEMU build and re-run script-generator; then the scripts will be the same as QEMU mode.

ajherchenroder commented 8 months ago

Why do you want to do that, though? If you want to look at the generated scripts, chroot/bwrap mode would make a lot more sense. If you then also want the QEMU scripts, modify steps/bootstrap.cfg within chroot to match a QEMU build and re-run script-generator; then the scripts will be the same as QEMU mode.

I respectfully remind you that you wrote the new system so that methodology obvious to you. Until you wrote your response that methodology was not obvious to me. In general I prefer to test using QEMU because it exercises the bootstrapped kernel and I get a better feel for how it would operate on bare metal. I would like to operate purely on bare metal but I don't have any machines left that will bios boot. QEMU is the closest I can get. ( I'm a hardware/systems guy by trade so I have a bias toward the bare metal). My end goal is to be able to host a Gentoo prefix on the bootstrapped system. Unfortunately there is a wide gulf between the end state of the bootstrap process and having a system capable of hosting a Gentoo prefix. To get there, I need to add a bunch of packages to the end of the bootstrap to get the system to a state where I can apply the prefix. Those are the packages I usually find my self troubleshooting. Anyway, thanks for the explanation I think I have a path forward.

fosslinux commented 8 months ago

I respectfully remind you that you wrote the new system so that methodology obvious to you. Until you wrote your response that methodology was not obvious to me.

That is totally fine, I'm here to help - I hope I did not come off as abrasive; sorry if I did. I was more wondering if there was a specific problem you were running into that made you specifically want to use QEMU, rather than attacking your methodology (although, I do see how it could have been read that way!)

In general I prefer to test using QEMU because it exercises the bootstrapped kernel and I get a better feel for how it would operate on bare metal. I would like to operate purely on bare metal but I don't have any machines left that will bios boot. QEMU is the closest I can get.

I respect this - and I think it makes a lot of sense particularly for the end stages of the bootstrap. Myself and stikonas tend to do most of our debugging/testing in chroot/bwrap for the early stages because it's so much easier to inspect the state of the system. Not to say it's impossible with QEMU - just harder.

Unfortunately there is a wide gulf between the end state of the bootstrap process and having a system capable of hosting a Gentoo prefix.

I agree.

My end goal is to be able to host a Gentoo prefix on the bootstrapped system.

I'd love to hear about progress on this as a number of people have been interested about this!

Thank you for engaging with and your interest in live-bootstrap.

Googulator commented 8 months ago

See https://github.com/fosslinux/live-bootstrap/pull/389 for a fix - note that the clean shutdown here doesn't quite work yet, so just use magic-SysRq to shut down from the prompt.

stikonas commented 8 months ago

Unfortunately there is a wide gulf between the end state of the bootstrap process and having a system capable of hosting a Gentoo prefix.

Do you know what is actually missing for hosting Gentoo? I would think that we have most of the stuff ready, just need to install more python packages for emerge to work... And probably a few small packages such as wget.

ajherchenroder commented 8 months ago

See #389 for a fix - note that the clean shutdown here doesn't quite work yet, so just use magic-SysRq to shut down from the prompt.

I saw that, It sounds like a good path forward. that will let me pull out the temporary script I stuck to the end of the bootstrap to call bash.

Unfortunately there is a wide gulf between the end state of the bootstrap process and having a system capable of hosting a Gentoo prefix.

Do you know what is actually missing for hosting Gentoo? I would think that we have most of the stuff ready, just need to install more python packages for emerge to work... And probably a few small packages such as wget.

The answer to that question requires me to parse out my goals a little more. The Gentoo prefix setup is just a bash script that automates a tailored version of the old three stage bootstrap process. My goal is to be able to execute the stock prefix script without having to change anything in the script itself. All modifications to things like make.conf need to be able to be made in the bootstrap environment without outside tools. To configure the bootstrap you need a semi-decent editor (I picked nano). That brings in the requirement for things like NCURSES and it's requirements. Getting those to build under the old SYS A-C system necessitated rebuilding things like the Linux headers that were not being properly passed thought to SYS C. Hopefully with the new layout I can pull those rebuilds out. Once I had a minimum set of tools to work with I ran into problems with the prefix script itself. The system is reliant on multi-user privilege separation in order to function properly. That means adding the necessary tools for that (shadow, agetty, libcap, etc) Even when I broke my own rules and commented out the relevant lockouts to see what happens, I ran into the next layer of issues. The prefix script really prefers Glibc to musl. The absence of Glibc pushes us out of RAP mode and into a older methodology for the bootstrap that appears to originally be intended for things like Solaris. I had kernel crashes and linking issues left and right. Right now I suspect that the 4.9.10 kernel we are using is too old or not properly configured to work with the prefix. After a couple of months of screwing with it I decided to cut my losses and I am working on adding a Linux 6.4.12 kernel and a trimmed down sysVinit based on the LFS version to the end of the bootstrap. The intent is to kexec into a new kernel with proper support for users and running daemons, and then run it the way it was intended to be run. Since I am clearly in over my head, If I get frustrated enough, I will build LFS on top of the bootstrap and run the the prefix on that. I know I can make that work. :)

ajherchenroder commented 7 months ago

As far as I’m concerned we can close this issue.