NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.04k stars 14.04k forks source link

`azure-image` doesn't build with nixos-unstable-small #39581

Closed colemickens closed 6 years ago

colemickens commented 6 years ago

Issue description

I can't get azure-image to build with a branch of nixos-unstable-small. It hangs at starting stage 2 (/nix/store/lp6sd4r785qhy77pxvfbyvfhk0cp5739-vm-run-stage2)

Steps to reproduce

git clone git@github.com:colemickens/nixpkgs.git
cd nixpkgs
git checkout bugreport
# note there's only one commit, bumping the version of qemu-img, this shouldn't impact the image build
cd nixos/maintainers/scripts/azure
./create-azure.sh

Output:

...
copying path '/nix/store/4vlpqqlw76r2yms47pra6m4dysv666gv-unit-polkit.service' to 'local'...
copying path '/nix/store/rbb67vh2llp8nkx9mr23l3cbw96lczz6-unit-dbus.service' to 'local'...
copying path '/nix/store/i2ks6i9k91gw1cgv56ag3081dxzabbrm-unit-systemd-fsck-.service' to 'local'...
copying path '/nix/store/9hc671d6sawnhrjpnm11csvk412aymzk-user-units' to 'local'...
copying path '/nix/store/p22imb8wql19hhsbf9ajw3sc91zsssgs-system-units' to 'local'...
copying path '/nix/store/69g0a9l3wqd2dabm08a89mdwllrdaxsz-etc' to 'local'...
copying path '/nix/store/761gcizvigvl8dm04jwaap56l9jbxmni-nixos-system-unnamed-18.09.git.03991bb' to 'local'...
copying channel...
installation finished!
copying staging root to image...
[    0.000000] Linux version 4.15.0 (nixbld@) (gcc version 7.3.0 (GCC)) #1 Tue Apr 10 04:20:30 UTC 2018
[    0.000000] bootmem address range: 0x7fffe9c00000 - 0x7fffeffff000
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 25249
[    0.000000] Kernel command line: mem=100M virtio_mmio.device=292@0x1000000:1
[    0.000000] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.000000] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.000000] Memory available: 100760k/102396k RAM
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS: 4096
[    0.000000] lkl: irqs initialized
[    0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] lkl: time and timers initialized (irq2)
[    0.000005] pid_max: default: 4096 minimum: 301
[    0.000033] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.000039] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.006030] console [lkl_console0] enabled
[    0.006064] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.006071] xor: automatically using best checksumming function   8regs
[    0.006189] random: get_random_u32 called from bucket_table_alloc+0x80/0x2d0 with crng_init=0
[    0.006260] NET: Registered protocol family 16
[    0.172122] raid6: int64x1  gen()  3437 MB/s
[    0.344114] raid6: int64x1  xor()  1927 MB/s
[    0.516077] raid6: int64x2  gen()  4026 MB/s
[    0.691532] raid6: int64x2  xor()  1770 MB/s
[    0.888004] raid6: int64x4  gen()  1116 MB/s
[    1.073002] raid6: int64x4  xor()   899 MB/s
[    1.247861] raid6: int64x8  gen()  2005 MB/s
[    1.338777] random: fast init done
[    1.419744] raid6: int64x8  xor()  2342 MB/s
[    1.419758] raid6: using algorithm int64x2 gen() 4026 MB/s
[    1.419763] raid6: .... xor() 1770 MB/s, rmw enabled
[    1.419794] raid6: using intx1 recovery algorithm
[    1.419976] clocksource: Switched to clocksource lkl
[    1.420146] NET: Registered protocol family 2
[    1.420365] TCP established hash table entries: 1024 (order: 1, 8192 bytes)
[    1.420426] TCP bind hash table entries: 1024 (order: 1, 8192 bytes)
[    1.420443] TCP: Hash tables configured (established 1024 bind 1024)
[    1.420574] UDP hash table entries: 128 (order: 0, 4096 bytes)
[    1.420593] UDP-Lite hash table entries: 128 (order: 0, 4096 bytes)
[    1.420720] virtio-mmio: Registering device virtio-mmio.0 at 0x1000000-0x1000123, IRQ 1.
[    1.421313] workingset: timestamp_bits=62 max_order=15 bucket_order=0
[    1.422581] SGI XFS with ACLs, security attributes, no debug enabled
[    1.428708] jitterentropy: Initialization failed with host not compliant with requirements: 2
[    1.428732] io scheduler noop registered
[    1.428737] io scheduler deadline registered
[    1.428863] io scheduler cfq registered (default)
[    1.428875] io scheduler mq-deadline registered
[    1.428879] io scheduler kyber registered
[    1.428904] virtio-mmio virtio-mmio.0: Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.
[    1.433736]  vda: vda1
[    1.434121] NET: Registered protocol family 10
[    1.434655] Segment Routing with IPv6
[    1.434689] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    1.435247] Btrfs loaded, crc32c=crc32c-generic
[    1.435499] Warning: unable to open an initial console.
[    1.435534] This architecture does not have kernel memory protection.
[    1.438122] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts:
[    3.219318] random: crng init done
[   12.394270] reboot: Restarting system
WARNING: Image format was not specified for 'nixos.raw' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
cSeaBIOS (version rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org)

iPXE (http://ipxe.org) 00:03.0 C980 PCI2.10 PnP PMM+3FF912F0+3FEF12F0 C980

Booting from ROM...
Probing EDD (edd=off to disable)... ok
cloading kernel modules...
mounting Nix store...
mounting host's temporary directory...
starting stage 2 (/nix/store/lp6sd4r785qhy77pxvfbyvfhk0cp5739-vm-run-stage2)
^Cerror: interrupted by the user

FULL OUTPUT: https://gist.github.com/colemickens/ee6030926248ee2d0d305c4fd120506d

Technical details

(the above should cover it)

Please run nix-shell -p nix-info --run "nix-info -m" and paste the results.

 - system: `"x86_64-linux"`
 - host os: `Linux 4.14.36, NixOS, 18.09.git.fa39203 (Jellyfish)`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.0.1`
 - channels(root): `"nixos"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos/nixpkgs`

(I'm not sure this will be applicable since the ./create-azure script references it's own nixpkgs enlistment via a local path...)

colemickens commented 6 years ago

cc: @copumpkin I think you refactored some stuff in make-disk-image.nix, probably since the last time anyone tried to build an Azure image. Do you have any pointers?

cc: @jbgi I saw you make an azure-image PR, have you seen this? What branch of nixpkgs do you build with?

colemickens commented 6 years ago

It seems that switch_root hangs? I added a bit of set -x to the vm-run-stage{1,2} scripts:

...
+ echo 'mounting Nix store...'
mounting Nix store...
+ mkdir -p /fs/nix/store
+ mount -t 9p store /fs/nix/store -o 'trans=virtio,version=9p2000.L,cache=loose'
+ mkdir -p /fs/tmp /fs/run /fs/var
+ mount -t tmpfs -o 'mode=1777' none /fs/tmp
+ mount -t tmpfs -o 'mode=755' none /fs/run
+ ln -sfn /run /fs/var/run
+ echo 'mounting host'"'"'s temporary directory...'
mounting host's temporary directory...
+ mkdir -p /fs/tmp/xchg
+ mount -t 9p xchg /fs/tmp/xchg -o 'trans=virtio,version=9p2000.L,cache=loose'
+ mkdir -p /fs/proc
+ mount -t proc none /fs/proc
+ mkdir -p /fs/sys
+ mount -t sysfs none /fs/sys
+ mkdir -p /fs/etc
+ ln -sf /proc/mounts /fs/etc/mtab
+ echo '127.0.0.1 localhost'
+ echo 'starting stage 2 (/nix/store/w4gk2prvisgmg4f0vr69ci8adxbwl1h0-vm-run-stage2)'
starting stage 2 (/nix/store/w4gk2prvisgmg4f0vr69ci8adxbwl1h0-vm-run-stage2)
+ exec switch_root /fs /nix/store/w4gk2prvisgmg4f0vr69ci8adxbwl1h0-vm-run-stage2 /nix/store/1z347yrqzf9kj00gp553pdyafx5p940p-azure-image

[hangs]
copumpkin commented 6 years ago

I think @edolstra basically rewrote most of the image building code since that refactor

colemickens commented 6 years ago

It looks like is 9pfs related. A similar thread: https://lists.gt.net/linux/kernel/2599597

jbgi commented 6 years ago

@colemickens I did not try to build an image. I originally just used the existing 16.09 image, upgraded it from time to get the 18.03-based image we use now.

colemickens commented 6 years ago

@jbgi thanks for the information.

It seems that I can build images on one of my local machines. Must be an oddity of running qemu in an Azure VM or something?

I don't know whether to leave this open or closed. I'll leave it up to someone else.

dezgeg commented 6 years ago

@colemickens I think all my patches about 9pfs hangs are merged in recent kernels, that shouldn't be the problem.

dezgeg commented 6 years ago

It seems that I can build images on one of my local machines. Must be an oddity of running qemu in an Azure VM or something?

That's almost certainly it. Nested hardware accelerated virtualization seems just to be horribly broken in Linux.

colemickens commented 6 years ago

This had worked in the past, but I think the VM I was building on yesterday is one of the SKUs that specifically has nested virtualization enabled.

I may end up patching the qemu invocation to disable KVM. It will probably be net faster based on how slow my upload is. Thanks for mentioning it, I hadn't really thought about it.

copumpkin commented 6 years ago

Yeah, I've also found that nested virtualization has basically stopped working entirely for me in the past year or so.

colemickens commented 6 years ago

I think this actually came down to me using a commit from nixos-unstable-small. I had been building images for Azure for a few weeks on an Azure VM without issue, and without having to take steps to disable nested virtualization. Must've been something weird with qemu/kvm/etc at a single point in time... Either way, closing this out.