Closed dongsupark closed 1 year ago
This is not intermittent to me. All my arm64 instances fail on stable. They run fine on beta.
@dongsupark if the AMI is corrupt we should take it down
@jepio As describe in the comment, the corrupt AMI was set to private
.
I've seen this corruption on stable 3227.2.2 arm64 AMIs in us-east-2 in the last few days. I'd suspect the issue affects the whole set, rather than just one AMI (since they're regional)
I've traced it down to AWS EC2 imports when the VMDK format is used. With the plain format everything works but after conversion to VMDK the same image becomes a corrupted AMI while locally it's still ok to boot the file with QEMU. As workaround I've prepared a change that will create the Flatcar AMIs from plain image uploads: https://github.com/flatcar/mantle/pull/391
For the record, instead of going through vmdk-convert
as we do now I've also tried to use qemu-img
directly to create the streamOptimized VMDK but it didn't help (qemu-img convert -O vmdk -o subformat=streamOptimized,adapter_type=lsilogic flatcar_production_ami_image.bin flatcar_production_ami_image.vmdk
).
For a while now (Oct 31) we switched to using the raw flatcar_production_ami_image.bin.bz2
image instead of VMDK and this worked around the issue. We tried to reach out to AWS but need to do so again to get the broken VMDK handling resolved.
Workaround is in place. Have not seen the issue recently.
My report to AWS didn't get acted on, so maybe we can just warn users about using the AMI images because we still publish them on the release server for download and mention them in the docs.
Description
During the recent release process, we encountered an unknown issue that
os/kola/aws
failed only with AWS arm64 of Stable 3227.2.2.Console log of the Kola test says:
Tried rerunning the specific kola tests, no luck. Tried rerunning the whole vm-matrix to regenerate the AMIs, and running kola tests again. Still no luck. It is obviously not possible to manually initiate an EC2 instance from the problematic AMI.
Impact
AWS kola tests for arm64 cannot run at all.
Environment and steps to reproduce
There is no simple way to reproduce this issue. That issue happens only in the specific case, not in other channels, not in other archs. We have seen a similar issue in this year, but not in Stable, not arm64.