Closed treed closed 8 years ago
I just tested and I can do a bare-metal install of 675.0.0, so it's specifically the upgrade.
Is this the first you've used CoreOS on the R420? I'm willing to bet this is related to your other bug (https://github.com/coreos/bugs/issues/340).
Following my workaround for #340, this was my first working install on this hardware, yes.
Given the workaround (disabling VT in the BIOS), I am able to use the system normally, use coreos-install to put CoreOS on the HD, and then boot into that HD-based install. It comes up just fine and is able to use the disks.
It's certainly possible that this is related to that bug, but I'd rate it as fairly unlikely for a system that has already booted off of the disk in question.
I guess I'd probably start debugging this by enumerating the differences in how coreos-install interacts with the disk compared with how the upgrade process interacts with the disk.
This is still happening as of 735.0.0 -> 738.1.0.
Here's a copy of the update_engine log: http://sprunge.us/eWLf
There are three errors in the log, all in OmahaResponseHandlerAction. I don't know how important those errors might be, but they seem worth pointing out.
I'm kind of at a loss to explain how the upgrade installer can hose the GPT as it does while other disk access works just fine.
@treed: Is the disk controller a raid card? We've recently gotten two independent reports of the GPT getting corrupted when using addon raid cards. We haven't been able to dig into this in depth yet, still need to acquire some problematic hardware oursevles to figure exactly when the corruption occurs but one theory is that the BIOS calls to write to the GPT in the bootloader do not work correctly on some devices. (a counter on the USR partition about to be booted is decremented to indicate that a boot was attempted)
If it is that write in the bootloader at fault it may be worth trying to boot the system in UEFI mode if you are not already, just in case writing via UEFI APIs do work properly. Beyond that I am going to need to get personal with some real hardware to figure out what is going on.
/cc @brianredbeard
@treed also, in at least one other case only the primary GPT was bad, so recovering from the backup worked. Booting to a coreos ISO or PXE image and running cgpt repair /dev/sda
or similar do that.
It is using a PERC card of some kind. I can look up the exact model tomorrow. I'll also check to see what the BIOS settings are and play around with that some.
If you, or someone else who'd like to, in the SF bay area, I might be able to arrange a coordinated debugging session with the hardware, or similar hardware. Our offices are in Mountain View/SF, and the servers are in Oakland.
@treed We had the same problem, very annoying to debug. After some trying we found a relatively simple workaround that worked for us. CoreOS was installed on multiple machines to /dev/sda, which was a RAID5 with sometimes ~2.5TB and sometimes ~5.5TB in size. Whenever it restarted after an upgrade, we ran into "invalid GPT signature". What solved it for us was changing our RAID config and go with a smaller set for CoreOS (2 disks in raid 1, less than 1 TB in size). The problem disappeared on all machines.
Okay, so a few results:
The card is a PERC H310 configured with a 5.5TB RAID 5.
Running cgpt repair claimed to fix things:
Primary Header is updated.
Primary Entries is updated.
But attempting to boot after that didn't even make it to GRUB. Now it just sits there after "PXE-M0F: Exiting Broadcom PXE ROM" after being instructed to boot from local disk.
It was previously configured for BIOS over UEFI. I've switched it to UEFI and am trying to test with that configuration.
@MikeRoetgers Thanks for that. If I can't get it working with UEFI I might give that a shot. Unfortunate, though.
@treed ok, at this point BIOS mode may not be working if the MBR boot code also got clobbered. UEFI mode should work though.
nod I'm currently struggling to even get this thing booting via UEFI. :(
I've got ipxe.lkrn being served up via pxe, with embedded instructions to boot CoreOS with a cloud-config that runs coreos-install and then reboots.
I found that I had to use a different syslinux for efi, but now it boots, pulls syslinux.efi and gets an IP and then just... does nothing. Trying to figure out if I need a different ipxe file or something.
treed: since there currently isn't a backup copy of the MBR code in the image (and no grub-install to generate one either) you can try to boot again via MBR by writing this to the disk:
wget https://storage.googleapis.com/users.developer.core-os.net/marineam/mbr.bin
dd if=mbr.bin of=/dev/sda
But if it is legacy bios mode that triggers the bug you'll get stuck again once a new update comes. So trying to boot via UEFI mode is still worth a shot.
I still haven't been able to get PXE working with UEFI, but I can verify that upgrading works fine if I make the root volume a single 2TB drive.
I am experiencing pretty much the same issue as @treed on Dell PowerEdge R630
with PERC H730 Mini (Embedded) Integrated RAID Controller
with RAID 0
(don't ask) and single 3,7 TB
disk. No UEFI
. PXE
boot and install to disk of 766.4.0
works without problems. Update to 766.5.0
(update-engine
or manual) breaks it.
sudo cgpt show /dev/sda
start size part contents
0 1 Unknown
1 1 INVALID Pri GPT header
2 32 INVALID Pri GPT table
7807959007 32 Sec GPT table
4096 262144 1 Label: "EFI-SYSTEM"
Type: EFI System Partition
UUID: 826ED773-DC1E-4214-AE22-95F37F00BA41
Attr: Legacy BIOS Bootable
266240 4096 2 Label: "BIOS-BOOT"
Type: BIOS Boot Partition
UUID: ACA99593-EC92-47C9-B513-E0E323A7D0B2
270336 2097152 3 Label: "USR-A"
Type: Alias for coreos-rootfs
UUID: 7130C94A-213A-4E5A-8E26-6CCE9662F132
Attr: priority=1 tries=0 successful=1
2367488 2097152 4 Label: "USR-B"
Type: Alias for coreos-rootfs
UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A57C
Attr: priority=2 tries=1 successful=0
4464640 262144 6 Label: "OEM"
Type: Alias for linux-data
UUID: 2B61E089-03FF-4CE7-A9DE-06560DD3A323
4726784 131072 7 Label: "OEM-CONFIG"
Type: CoreOS reserved
UUID: 2AD50847-5108-439A-81FE-4A3EF33977DD
4857856 7803101151 9 Label: "ROOT"
Type: CoreOS auto-resize
UUID: 2BA5EFDF-B69C-45AC-9DD6-6C55BC5D9941
7807959039 1 Sec GPT header
WARNING: one of the GPT header/entries is invalid, please run 'cgpt repair'
Booting PXE
and repairing doesn't help. Current "workaround" is to disable updates and reboots.
835.8.0
with UEFI
works like a charm.
@treed were you ever able to get UEFI working? I'm curious if that is enough to allow larger RAID arrays.
It's been a while but I don't think so. I ended up redoing my RAID so that I had a single disk root volume and a 3-disk RAID-5 for /var/lib/docker
On Mon, Jan 25, 2016 at 7:18 PM Alex Crawford notifications@github.com wrote:
@treed https://github.com/treed were you ever able to get UEFI working? I'm curious if that is enough to allow larger RAID arrays.
— Reply to this email directly or view it on GitHub https://github.com/coreos/bugs/issues/356#issuecomment-174797136.
OK. I'm going to close this one since it looks like UEFI
works. Feel free to re-open it if you run into trouble again.
I just ran into this issue on a Dell R720 after an upgrade to 899.17.0 (Stable) with a 4.7 TB HW Raid partition and Legacy Bios. This issue doesn't seem to be recoverable. Luckily I was still in testing and therefore I just rebuilt with UEFI turned on but also repartitioning to have a smaller main just in case. I recommend that this is listed in the known issues on the bare-metal install instructions somewhere.
Thanks,
Eric
I installed 668.2.0 to baremetal earlier this week; starting yesterday, it's been trying to automatically upgrade and hosing the install.
After the automatic reboot, GRUB gives three options:
CoreOS default
says:CoreOS USR-A
andCoreOS USB-B
just give the latter message. This is on a Dell PowerEdge R420. Let me know if you need to know anything else about it.