coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
263 stars 60 forks source link

Ignition fails to wipe disk if partition table is corrupted #1596

Closed edcdavid closed 11 months ago

edcdavid commented 11 months ago

Bug coreos/ignition#1728

Operating System Version

Fedora CoreOS 38 (38.20230722.3.0)

Ignition Version

Ignition 2.16.2

Environment

What hardware/cloud provider/hypervisor is being used to run Ignition? CoreOS running on libvirt VM

Expected Behavior

When "wipe_table: true" is specified the disk device should be wiped even if its partition table is corrupted

Actual Behavior

Ignition fails to wipe disk with sgdisk error: [ 35.521754] ignition[855]: disks: createPartitions: op(2): op(3): [failed] deleting 0 partitions and creating 0 partitions on "/run/ignition/dev_aliases/dev/vdb": exit status 2: Cmd: "sgdisk" "--zap-all" "/run/ignition/dev_aliases/dev/vdb" Stdout: "GPT data structures destroyed! You may now partition the disk using fdisk or\nother utilities.\n" Stderr: "\aCaution: invalid main GPT header, but valid backup; regenerating main header\nfrom backup!\n\n\aWarning: Invalid CRC on main header data; loaded backup partition table.\nWarning! Main and backup partition tables differ! Use the 'c' and 'e' options\non the recovery & transformation menu to examine the two tables.\n\n\aWarning! Main partition table CRC mismatch! Loaded backup partition table\ninstead of main partition table!\n\nWarning! One or more CRCs don't match. You should repair the disk!\nMain header: ERROR\nBackup header: OK\nMain partition table: ERROR\nBackup partition table: OK\n\nInvalid partition data!\n"

Reproduction Steps

  1. Create a disk with corrupted partition table, for instance: sgdisk --new=2:0:0 -i=2 /dev/vdb dd if=/dev/urandom of=/dev/vdb bs=512 count=3
  2. create a VM using this disk. Use ignition to wipe disk (wipe_table: true) and create a partition:
    storage:
    disks:
    - device: /dev/vdb
      wipe_table: true
      partitions:
        - label: data
          number: 1
          size_mib: 1024
          start_mib: 0

Other Information

Seem to be due to sgdisk returning an error when the disk partition table is corrupted. In this case ignition seems to returns an error on commit fail: https://github.com/coreos/ignition/blob/14151500635e8a0b020897fb2fd68706d8056fbb/internal/exec/stages/disks/partitions.go#L327 Should we ignore/log this error instead of returning?

travier commented 11 months ago

Initially filed in Ignition but moving to the FCOS tracker as this is about FCOS.

jlebon commented 11 months ago

Hmm, that seems like a behavioural bug in sgdisk. That said, it looks like wipefs -at gpt also works, so we could consider switching to that. I'd rather we try to get sgdisk fixed though.

edcdavid commented 11 months ago

After looking in sgdisk repo, it looks like sgdisk is always returning exit code 2 when the partition table cannot be read, see https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gptcl.cc#l490. So the call to wipe partitions is called in two different places:

imho, sgdisk should only fail if the function wiping the GPT/MBR is returning an error, right now it is only printing to stderr, for example here: https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gpt.cc#l1529

edcdavid commented 11 months ago

The issue seems to have been introduced by this PR to fix a linter error: https://github.com/coreos/ignition/pull/1149 The previous behavior was to assume disk wipe is always successful. If actually not successful, the error would be caught by the next sgdisk calls to create partitions, which also calls the same function to discover partitions. Why not just revert to the old behavior and whitelist the linter warning? I created the following PR for sgdisk but I am afraid this new logic will break other tools: https://sourceforge.net/p/gptfdisk/code/merge-requests/35/

jlebon commented 11 months ago

The issue seems to have been introduced by this PR to fix a linter error: coreos/ignition#1149 The previous behavior was to assume disk wipe is always successful. If actually not successful, the error would be caught by the next sgdisk calls to create partitions, which also calls the same function to discover partitions. Why not just revert to the old behavior and whitelist the linter warning?

Nice find! Doing some git investigations let me to https://github.com/coreos/ignition/pull/544#discussion_r189388532. Aha! We did use to retry this operation in the past, but it was lost during that PR because it wasn't clear why it was added in the first place (and tests still passed). Hmm, was it perhaps for a similar issue as here? Digging further reveals that yes, that's exactly it: https://github.com/coreos/ignition/pull/162.

Restored this logic in https://github.com/coreos/ignition/pull/1735, now with a comment and a test to make sure we don't regress on this again.

I created the following PR for sgdisk but I am afraid this new logic will break other tools: sourceforge.net/p/gptfdisk/code/merge-requests/35

Thanks! If this merges, we would in theory be able to eventually drop the retry logic in Ignition.

edcdavid commented 11 months ago

Great, sounds good! Thanks for the fix and test.