coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Intel microcodes rev 0x2d cause frequent crashes #2412

Open DrMurx opened 6 years ago

DrMurx commented 6 years ago

Issue Report

Bug

Since I upgraded to the CoreOS release which includes the new Intel microcodes rev 0x2d which are supposed to mitigate Spectre & Meltdown, my server reboots occasionally. Apparently, the issues Intel made to revoke the first iteration of those microcode updates in January still exist on older CPUs.

Using the kernel parameter noibrs noibpb seems to help a bit and reduces the frequency of the reboot events, but it doesn't eliminate it completely.

Therefore I would prefer to stick with the previous microcode version. Is there a simple way to do this?

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1722.2.0
VERSION_ID=1722.2.0
BUILD_ID=2018-03-29-0338
PRETTY_NAME="Container Linux by CoreOS 1722.2.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Expected Behavior

No crashes :)

Actual Behavior

Server reboots occasionally.

Reproduction Steps

-/-

Other Information

bgilbert commented 6 years ago

You can disable boot-time microcode updates by adding the following line to /usr/share/oem/grub.cfg:

set linux_append="$linux_append dis_ucode_ldr"
DrMurx commented 6 years ago

@bgilbert Thanks for that hint. I'll give it a try.

Apart from disabling all microcodes, is there a way to load the previous microcode release? I didn't dive too deep into the early microcode loading in the Kernel/Grub, but somehow it must be able to deal with multiple microcode blobs - yet alone to support Intel and AMD, so maybe it also supports multiple releases too?

bgilbert commented 6 years ago

The microcode loader doesn't natively support multiple releases. Because of the search order and the fact that we build the microcode directly into the kernel, there's no way to replace it at runtime other than by rebuilding the kernel.

lorenz commented 6 years ago

Assuming that this is a Sandy Bridge CPU, this would be very bad. 0x2d is the production-tested version recommended by Intel.

DrMurx commented 6 years ago

I've been using dis_ucode_ldr for the past 12 days, falling back to the microcode version provided by the BIOS (which is 0x28). System runs smooth without crashes. Obviously the microcode version recommended by Intel still has issues.

lorenz commented 6 years ago

I sadly don't have a Sandy Bridge system to test on, but if that's the case for all Sandy Bridge CPUs we should probably outright pull it from CoreOS and not just offer more versions.

DrMurx commented 6 years ago

@lorenz I've now pulled it and moved my workload to a newer machine; I've cancelled the old box for the 13th of May, until then I could give you access if you want to conduct some tests. Just drop me an email.