Dynamically-sized EC2 image exceeds output size limit on Hydra

Some crumbs from #nixos-infra

[12:15:03] <samueldr> same build
[12:15:03] <samueldr> https://gist.github.com/samueldr/67534945d56489a6747acdbd4223a5be
[12:15:06] <samueldr> ran 20 times
[12:15:22] <samueldr> only difference is a comment in the script with #${toString builtins.currentTime}
[12:15:51] <samueldr> we have three builds that look more "normal" comparing with x86_64 equivalent builds

[12:20:52] <lukegb> samueldr: does the same thing happen with x86_64?
[12:21:08] <samueldr> I didn't run in loop
[12:21:10] <samueldr> but out of 5 local builds
[12:21:21] <samueldr> looks basically the same, few bytes difference in the result
[12:21:31] <samueldr> not half a gigabyte
[12:22:18] <samueldr> so I'm really thinking that something on aarch64 acts just different enough to sometimes cause... weirdness?
[12:24:48] <samueldr> that inconsistency is troubling
[12:29:00] <gchristensen> very
[12:29:00] <samueldr> restarted with discard param
[12:29:39] <samueldr> also, 4 out of 20 times, that's around 80% of the time doing the "weird" thing
[12:29:53] <samueldr> at the very least it's easier to get a good feeling that things are going right

TLDR:

On x86_64 the output of qemu-img (the vhd file) is always "pretty well resized".

On aarch64, about 20% of the time it will resize "well", like we expect, but the rest of the time it barely sizes down.

This was tested on the community builder, by adding a # ${toString builtins.currentTime} comment in the postVM attribute, so right after the VM ran, normally this means the build shouldn't be impacted in any major way.

It has yet to be explored why sometimes qemu-img fails to... sparsify?? the image. It does not look like it's about filesystem discard, as setting the proper config for the VM and running e2fsck to discard things won't help.

If I were to look right now, at the issue, I would try getting a raw disk image that fails to sparsify well, and manually run qemu-img on it a couple of times to see if the results change.

NixOS / nixpkgs

Dynamically-sized EC2 image exceeds output size limit on Hydra #121354