Intermittent AWS AMI corruptions

dongsupark commented 2 years ago

Description

During the recent release process, we encountered an unknown issue that os/kola/aws failed only with AWS arm64 of Stable 3227.2.2.

Console log of the Kola test says:

[    4.245983] systemd-fsck[680]: ROOT contains a file system with errors, check forced.
ROOT: fsck 0.0% complete...
[    4.340402] device-mapper: verity: sha256 using implementation "sha256-ce"
ROOT: fsck 81.4% complete...
[    4.316076] systemd-fsck[680]: ROOT: Directory inode 7252, block #0, offset 0: directory corrupted
[    4.317332] systemd-fsck[680]: ROOT: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
[    4.317526] systemd-fsck[680]: (i.e., without -a or -p options)
[    4.324259] systemd-fsck[674]: fsck failed with exit status 4.
[FAILED] Failed to start File Syste…ck on /dev/disk/by-label/ROOT.

Tried rerunning the specific kola tests, no luck. Tried rerunning the whole vm-matrix to regenerate the AMIs, and running kola tests again. Still no luck. It is obviously not possible to manually initiate an EC2 instance from the problematic AMI.

Impact

AWS kola tests for arm64 cannot run at all.

Environment and steps to reproduce

There is no simple way to reproduce this issue. That issue happens only in the specific case, not in other channels, not in other archs. We have seen a similar issue in this year, but not in Stable, not arm64.

sdlarsen commented 2 years ago

This is not intermittent to me. All my arm64 instances fail on stable. They run fine on beta.

jepio commented 2 years ago

@dongsupark if the AMI is corrupt we should take it down

dongsupark commented 2 years ago

@jepio As describe in the comment, the corrupt AMI was set to private.

dghubble commented 2 years ago

I've seen this corruption on stable 3227.2.2 arm64 AMIs in us-east-2 in the last few days. I'd suspect the issue affects the whole set, rather than just one AMI (since they're regional)

pothos commented 2 years ago

I've traced it down to AWS EC2 imports when the VMDK format is used. With the plain format everything works but after conversion to VMDK the same image becomes a corrupted AMI while locally it's still ok to boot the file with QEMU. As workaround I've prepared a change that will create the Flatcar AMIs from plain image uploads: https://github.com/flatcar/mantle/pull/391

pothos commented 2 years ago

For the record, instead of going through vmdk-convert as we do now I've also tried to use qemu-img directly to create the streamOptimized VMDK but it didn't help (qemu-img convert -O vmdk -o subformat=streamOptimized,adapter_type=lsilogic flatcar_production_ami_image.bin flatcar_production_ami_image.vmdk).

pothos commented 1 year ago

For a while now (Oct 31) we switched to using the raw flatcar_production_ami_image.bin.bz2 image instead of VMDK and this worked around the issue. We tried to reach out to AWS but need to do so again to get the broken VMDK handling resolved.

dongsupark commented 1 year ago

Workaround is in place. Have not seen the issue recently.

pothos commented 1 year ago

My report to AWS didn't get acted on, so maybe we can just warn users about using the AMI images because we still publish them on the release server for download and mention them in the docs.

dongsupark commented 1 year ago

PR https://github.com/flatcar/flatcar-docs/pull/334

flatcar / Flatcar