crc-org / vfkit

Apache License 2.0
125 stars 24 forks source link

Cannot use kernel newer than kernel-5.18.11-200.fc36 #11

Open cfergeau opened 1 year ago

cfergeau commented 1 year ago

https://koji.fedoraproject.org/koji/buildinfo?buildID=2000811 has been used in the latest crc's podman bundle and this kernel was working fine. However, I've been unable to boot anything newer than this on my Mac M1 using Code-Hex/vz. vfkit would have the same issue. I tried kernel-5.18.13-200.fc36 , https://koji.fedoraproject.org/koji/buildinfo?buildID=2020964, several newer 5.19.x versions, kernel-6.0.0-54.fc38, ... none of these seemed to be able to boot :(

cfergeau commented 1 year ago

Might be similar to https://github.com/Code-Hex/vz/issues/51#issuecomment-1258348997

cfergeau commented 1 year ago

Description is inaccurate, as the 4.2.0 podman bundle is actually using /Users/teuf/.crc/cache/crc_podman_vfkit_4.2.0_arm64/vmlinuz-5.18.18-200.fc36.aarch64 and this version works correctly. I need to understand what's going on!

cfergeau commented 1 year ago

5.19 kernels don't seem to boot with vfkit/Code-Hex/vz. I tried the 5.19.4 and 5.19.13 fedora kernels. 5.18 kernels have been ok in my testing.

cfergeau commented 1 year ago

x86_64 is similarly impacted. I'm trying to get early logs from the virtualization framework, but it's not that easy :-/

cfergeau commented 1 year ago

Did some more testing. I tried with https://github.com/evansm7/vftool which also uses Apple's virtualization framework but is implemented in objective C. It fails in the same way. One thing I did not notice earlier is that with these failing kernels, the virtual machine reports an error state. I haven't found details about what caused this error state though. I tried debian kernels, and the 2 kernels I tried (5.10.0-18 and 6.0.0-1) failed with VZVirtualMachineStateError.

I tried qemu, and I was able to boot all the kernels I tried, for example:

qemu-system-aarch64 -nographic -append "console=ttyAMA0 debug" -cpu max -accel hvf --machine virt -kernel ~/dev/beaker-kernels/vmlinux-6.0.0-54.fc38.aarch6 -initrd ~/dev/beaker-kernels/initramfs-6.0.0-54.fc38.aarch64.img -m 1024
cfergeau commented 1 year ago

On x86_64, 5.19 kernels from f36 are booting fine. 6.0 f36 kernels fail to boot, the VM gets in an error state.

cfergeau commented 1 year ago

I filed https://bugzilla.redhat.com/show_bug.cgi?id=2137803

cfergeau commented 1 year ago

I tried Ubuntu 6.0 kernels from https://launchpad.net/~tuxinvader on an x86_64 macbook, and they fail with the same problem. This happens both on macOS 11 and 12. I rebuilt fedora 5.19 and 6.1 kernels on a rhel8 machine, and tested them on the x86_64 macOS11 macbook, 5.19 works, and 6.1 fails in the same way.

Last but not least, I upgraded the m1 machine from 12 to 13, and 5.19+ kernels are now working fine! Latest macOS (12.6.1) still had the issue.

cfergeau commented 1 year ago

I managed to find the problematic commit(s) for kernel 5.19 / 6.x. This was tested on my macOS11 / x86_64 macbook.

Early 5.19 kernels fail to boot until https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4b1c742407571eff58b6de9881889f7ca7c4b4dc is applied (5.19.6 and newer should be fine)

6.x kernels need https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6cd514e58f12b211d638dbf6f791fa18d854f09c to be reverted and then they work fine (only tested on x86_64 so far)

cfergeau commented 1 year ago

I mentioned the breakage in https://bugzilla.kernel.org/show_bug.cgi?id=215989 , and this was forwarded to the LKML https://lkml.org/lkml/2022/11/4/780

cfergeau commented 1 year ago

The M1 boot failure is apparently something else, as a patched kernel does not boot.

cfergeau commented 1 year ago

I've bisected the M1 failure to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e64b862c4823ab53aac028042abd918c2f27041 With just this change, I can boot latest fedora kernel on a macOS12 M1

diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 28d4f442b0bc..a17c876696ee 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -432,7 +432,9 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
        info->reg_id_aa64pfr0 = read_cpuid(ID_AA64PFR0_EL1);
        info->reg_id_aa64pfr1 = read_cpuid(ID_AA64PFR1_EL1);
        info->reg_id_aa64zfr0 = read_cpuid(ID_AA64ZFR0_EL1);
+#if 0
        info->reg_id_aa64smfr0 = read_cpuid(ID_AA64SMFR0_EL1);
+#endif

        if (id_aa64pfr1_mte(info->reg_id_aa64pfr1))
                info->reg_gmid = read_cpuid(GMID_EL1);
cfergeau commented 1 year ago

I filed an issue with Apple for the M1 problem to ask them if they can fix this for macOS 12.

cfergeau commented 1 year ago

Last but not least, I upgraded the m1 machine from 12 to 13, and 5.19+ kernels are now working fine! Latest macOS (12.6.1) still had the issue.

On Intel with macOS 13, I can also successfully boot 6.x kernels from fedora!

cfergeau commented 1 year ago

I've bisected the M1 failure to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e64b862c4823ab53aac028042abd918c2f27041 With just this change, I can boot latest fedora kernel on a macOS12 M1

diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 28d4f442b0bc..a17c876696ee 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -432,7 +432,9 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
        info->reg_id_aa64pfr0 = read_cpuid(ID_AA64PFR0_EL1);
        info->reg_id_aa64pfr1 = read_cpuid(ID_AA64PFR1_EL1);
        info->reg_id_aa64zfr0 = read_cpuid(ID_AA64ZFR0_EL1);
+#if 0
        info->reg_id_aa64smfr0 = read_cpuid(ID_AA64SMFR0_EL1);
+#endif

        if (id_aa64pfr1_mte(info->reg_id_aa64pfr1))
                info->reg_gmid = read_cpuid(GMID_EL1);

I've sent http://lists.infradead.org/pipermail/linux-arm-kernel/2022-November/788096.html about this.

cfergeau commented 1 year ago

I've sent http://lists.infradead.org/pipermail/linux-arm-kernel/2022-November/788096.html about this.

And it looks like a kernel workaround for the M1 issue is going to be a hard sell http://lists.infradead.org/pipermail/linux-arm-kernel/2022-November/788450.html Let's wait and hope that what Apple says something about this problem.