NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.82k stars 13.92k forks source link

Dynamically-sized EC2 image exceeds output size limit on Hydra #121354

Open lukegb opened 3 years ago

lukegb commented 3 years ago

Describe the bug For aarch64 (only, it seems), the amazonImage exceeds the output size limit on Hydra: https://hydra.nixos.org/build/142425973

This is channel-blocking.

Notify maintainers @samueldr @grahamc

Metadata Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
samueldr commented 3 years ago

Some crumbs from #nixos-infra

[12:15:03] <samueldr> same build
[12:15:03] <samueldr> https://gist.github.com/samueldr/67534945d56489a6747acdbd4223a5be
[12:15:06] <samueldr> ran 20 times
[12:15:22] <samueldr> only difference is a comment in the script with #${toString builtins.currentTime}
[12:15:51] <samueldr> we have three builds that look more "normal" comparing with x86_64 equivalent builds

[12:20:52] <lukegb> samueldr: does the same thing happen with x86_64?
[12:21:08] <samueldr> I didn't run in loop
[12:21:10] <samueldr> but out of 5 local builds
[12:21:21] <samueldr> looks basically the same, few bytes difference in the result
[12:21:31] <samueldr> not half a gigabyte
[12:22:18] <samueldr> so I'm really thinking that something on aarch64 acts just different enough to sometimes cause... weirdness?
[12:24:48] <samueldr> that inconsistency is troubling
[12:29:00] <gchristensen> very
[12:29:00] <samueldr> restarted with discard param
[12:29:39] <samueldr> also, 4 out of 20 times, that's around 80% of the time doing the "weird" thing
[12:29:53] <samueldr> at the very least it's easier to get a good feeling that things are going right

TLDR:

On x86_64 the output of qemu-img (the vhd file) is always "pretty well resized".

On aarch64, about 20% of the time it will resize "well", like we expect, but the rest of the time it barely sizes down.

This was tested on the community builder, by adding a # ${toString builtins.currentTime} comment in the postVM attribute, so right after the VM ran, normally this means the build shouldn't be impacted in any major way.

It has yet to be explored why sometimes qemu-img fails to... sparsify?? the image. It does not look like it's about filesystem discard, as setting the proper config for the VM and running e2fsck to discard things won't help.

If I were to look right now, at the issue, I would try getting a raw disk image that fails to sparsify well, and manually run qemu-img on it a couple of times to see if the results change.