coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Bare-Metal Block Device Failures HP #2390

Open jordy25519 opened 6 years ago

jordy25519 commented 6 years ago

Issue Report

Bug

Block devices operations are failing with the following error in journal logs.

NMI: PCI system error (SERR) for reason b1 on CPU 0.
Mar 27 20:53:20 localhost kernel: Dazed and confused, but trying to continue
Mar 27 20:53:20 localhost kernel: DMAR: DRHD: handling fault status reg 2
Mar 27 20:53:20 localhost kernel: DMAR: [DMA Read] Request device [04:00.0] fault addr ffdda000 [fault reason 06] PTE Read access is not set

however: lsblk shows

loop0    7:0    0 295.7M  0 loop /usr
sdb      8:16   0 279.4G  0 disk 
sdc      8:32   0 279.4G  0 disk 
`-sdc1   8:33   0 279.4G  0 part 
sdd      8:48   0 558.9G  0 disk 
`-sdd1   8:49   0 558.9G  0 part 
sde      8:64   0 931.5G  0 disk 
`-sde1   8:65   0 931.5G  0 part 

I suspect it may be related to HW / raid array drivers etc. I am able to successfully installed Ubuntu on this server with all disks in operation.

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1632.3.0
VERSION_ID=1632.3.0
BUILD_ID=2018-02-14-0338
PRETTY_NAME="Container Linux by CoreOS 1632.3.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Bare Metal, HP ProLiant Server (DL380G6)

Expected Behavior

block devices are mapped and accessible

Reproduction Steps

Matchbox PXE boot a bare-metal HPDL380G6 server

mwthink commented 6 years ago

I'm currently experiencing this issue and have been debugging it for awhile now. Pretty sure I've narrowed it down to the RAID controller firmware. By any chance, are you using a P410 card for your disks?

Also, can you check the output of fdisk -l? For me, I can see the devices in lsblk, but they won't show on fdisk.

jordy25519 commented 6 years ago

I believe it was a 410i card. As far as I remember fdisk -l showed no disks. I've since left the company so can't be much more help. In the past I've been able to install CoreOS by using random older versions and channels and then updating to the current release if the install succeeded.