axsh / wakame-vdc

Datacenter Hypervisor - Open Source Cloud Computing / IaaS
http://wakame-vdc.org
109 stars 28 forks source link

investigate faild 1box build #756

Open t-iwano opened 9 years ago

t-iwano commented 9 years ago
17:06:41 /tmp/tmp1445414732/boot/vmlinuz-2.6.32-358.el6.x86_64
17:06:41 /tmp/tmp1445414732/boot/initramfs-2.6.32-358.el6.x86_64.img
17:06:41 [INFO] Generating /tmp/vmbuilder-grub/device.map
17:06:41 (hd0) /tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
17:06:41 [INFO] Installing grub
17:07:19 
17:07:19 
17:07:19     GNU GRUB  version 0.97  (640K lower / 3072K upper memory)
17:07:19 
17:07:19  [ Minimal BASH-like line editing is supported.  For the first word, TAB
17:07:19    lists possible command completions.  Anywhere else TAB lists the possible
17:07:19    completions of a device/filename.]
17:07:19 grub> root (hd0,0)
17:07:19 
17:07:19 Error 21�: Selected disk does not exist
17:07:19 grub> setup (hd0)
17:07:19 
17:07:19 Error 12�: Invalid device requested
17:07:19 grub> quit
17:07:19 [INFO] Generating /boot/grub/grub.conf
17:07:19 default=0
17:07:19 timeout=5
17:07:19 splashimage=(hd0,0)/boot/grub/splash.xpm.gz
17:07:19 hiddenmenu
17:07:19 title centos-6.4_x86_64 (2.6.32-358.el6.x86_64)
17:07:19         root (hd0,0)
17:07:19         kernel /boot/vmlinuz-2.6.32-358.el6.x86_64 ro root=LABEL=root rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM selinux=0
17:07:19         initrd /boot/initramfs-2.6.32-358.el6.x86_64.img
17:07:19 umount: /tmp/tmp1445414732//tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw: not mounted
17:07:19 [DEBUG] Unmounting /tmp/tmp1445414732/proc
17:07:19 [DEBUG] Unmounting /tmp/tmp1445414732/dev
17:07:19 [DEBUG] Unmounting /tmp/tmp1445414732/sys
17:07:19 [DEBUG] Unmounting /tmp/tmp1445414732
17:07:20 [DEBUG] trap_vm fired
17:07:21 [WARN] still mapped: /var/lib/jenkins/workspace/dummy.1box/1box-dummy.netfilter.x86_64.raw (disk.sh:470)
17:07:21 [DEBUG] Removing parted old map with 'dmsetup remove loop0p1'
17:07:24 
17:07:24 real   1m52.312s
17:07:24 user   0m33.260s
17:07:24 sys    0m10.657s
17:07:24 make: *** [dummy64.netfilter] Error 1
17:07:24 Makefile:63: recipe for target 'dummy64.netfilter' failed
17:07:24 Build step 'Execute managed script' marked build as failure
17:07:25 Finished: FAILURE
t-iwano commented 9 years ago

grub install するときに対象のdiskが見つからない。

t-iwano commented 9 years ago

対象コード

https://github.com/axsh/vmbuilder/blob/master/kvm/rhel/6/functions/distro.sh#L1116-L1120

t-iwano commented 9 years ago
[root@ct68 dummy.1box]# kpartx -va 1box-dummy.netfilter.x86_64.raw
add map loop0p1 (253:0): 0 41940930 linear /dev/loop0 63
t-iwano commented 9 years ago
[root@ct68 dummy.1box]# mount /dev/mapper/loop0p1 /tmp/tmp1445414732
[root@ct68 dummy.1box]# mount --bind /proc /tmp/tmp1445414732/proc
[root@ct68 dummy.1box]# mount --bind /dev /tmp/tmp1445414732/dev
[root@ct68 dummy.1box]# mount --bind /sys /tmp/tmp1445414732/sys
t-iwano commented 9 years ago
[root@ct68 dummy.1box]# ls -la /tmp/tmp1445414732/tmp/vmbuilder-grub/device.map
-rw-r--r-- 1 root root 58 Oct 21 17:06 /tmp/tmp1445414732/tmp/vmbuilder-grub/device.map
[root@ct68 dummy.1box]# cat /tmp/tmp1445414732/tmp/vmbuilder-grub/device.map
(hd0) /tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
t-iwano commented 9 years ago
[root@ct68 dummy.1box]# ls -la /tmp/tmp1445414732/boot/grub/
total 408
drwxr-xr-x 2 root root   4096 Oct 21 17:07 .
dr-xr-xr-x 4 root root   4096 Oct 21 17:06 ..
-rw-r--r-- 1 root root  13380 Feb 22  2013 e2fs_stage1_5
-rw-r--r-- 1 root root  12620 Feb 22  2013 fat_stage1_5
-rw-r--r-- 1 root root  11748 Feb 22  2013 ffs_stage1_5
-rw-r--r-- 1 root root    408 Oct 21 17:07 grub.conf
-rw-r--r-- 1 root root  11756 Feb 22  2013 iso9660_stage1_5
-rw-r--r-- 1 root root  13268 Feb 22  2013 jfs_stage1_5
lrwxrwxrwx 1 root root      9 Oct 21 17:07 menu.lst -> grub.conf
-rw-r--r-- 1 root root  11956 Feb 22  2013 minix_stage1_5
-rw-r--r-- 1 root root  14412 Feb 22  2013 reiserfs_stage1_5
-rw-r--r-- 1 root root   1341 Nov 15  2010 splash.xpm.gz
-rw-r--r-- 1 root root    512 Feb 22  2013 stage1
-rw-r--r-- 1 root root 125992 Feb 22  2013 stage2
-rw-r--r-- 1 root root 125992 Feb 22  2013 stage2_eltorito
-rw-r--r-- 1 root root  12024 Feb 22  2013 ufs2_stage1_5
-rw-r--r-- 1 root root  11364 Feb 22  2013 vstafs_stage1_5
-rw-r--r-- 1 root root  13964 Feb 22  2013 xfs_stage1_5
t-iwano commented 9 years ago
[root@ct68 dummy.1box]# ls -la /tmp/tmp1445414732/usr/share/grub/x86_64-redhat/
total 400
drwxr-xr-x 2 root root   4096 Aug 22 20:38 .
drwxr-xr-x 3 root root   4096 Aug 22 20:38 ..
-rw-r--r-- 1 root root  13380 Feb 22  2013 e2fs_stage1_5
-rw-r--r-- 1 root root  12620 Feb 22  2013 fat_stage1_5
-rw-r--r-- 1 root root  11748 Feb 22  2013 ffs_stage1_5
-rw-r--r-- 1 root root  11756 Feb 22  2013 iso9660_stage1_5
-rw-r--r-- 1 root root  13268 Feb 22  2013 jfs_stage1_5
-rw-r--r-- 1 root root  11956 Feb 22  2013 minix_stage1_5
-rw-r--r-- 1 root root  14412 Feb 22  2013 reiserfs_stage1_5
-rw-r--r-- 1 root root    512 Feb 22  2013 stage1
-rw-r--r-- 1 root root 125992 Feb 22  2013 stage2
-rw-r--r-- 1 root root 125992 Feb 22  2013 stage2_eltorito
-rw-r--r-- 1 root root  12024 Feb 22  2013 ufs2_stage1_5
-rw-r--r-- 1 root root  11364 Feb 22  2013 vstafs_stage1_5
-rw-r--r-- 1 root root  13964 Feb 22  2013 xfs_stage1_5
t-iwano commented 9 years ago
[root@ct68 dummy.1box]# mount --bind 1box-dummy.netfilter.x86_64.raw /tmp/tmp1445414732/tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
t-iwano commented 9 years ago
[root@ct68 dummy.1box]# chroot /tmp/tmp1445414732 bash -e -c grub --batch --device-map=/tmp/tmp1445414732/tmp/vmbuilder-grub/device.map
Probing devices to guess BIOS drives. This may take a long time.

    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]
grub> root (hd0,0)
root (hd0,0)

Error 21▒: Selected disk does not exist
grub>
t-iwano commented 9 years ago

1box-dummy.netfilter.x86_64.rawのbind mountがうまくいってないように見える。

t-iwano commented 9 years ago

ここにある関数をrefacterする必要がある? https://github.com/axsh/vmbuilder/blob/master/kvm/rhel/6/functions/distro.sh#L1047-L1129

triggers commented 8 years ago

This line is not working, but returns no error: https://github.com/axsh/vmbuilder/blob/master/kvm/rhel/6/functions/distro.sh#L1061

     mount --bind ${disk_filename} ${chroot_dir}/${new_filename}
[root@ct68 dummy.1box]# ls -l /var/lib/jenkins/workspace/dummy.1box/1box-dummy.netfilter.x86_64.raw /tmp/tmp1447664384//tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
-rw-r--r-- 1 root root           0 Nov 16 18:00 /tmp/tmp1447664384//tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
-rw-r--r-- 1 root root 21474836480 Nov 16 18:01 /var/lib/jenkins/workspace/dummy.1box/1box-dummy.netfilter.x86_64.raw

[root@ct68 dummy.1box]# mount --bind /var/lib/jenkins/workspace/dummy.1box/1box-dummy.netfilter.x86_64.raw /tmp/tmp1447664384//tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
[root@ct68 dummy.1box]# echo $?
0

[root@ct68 dummy.1box]# ls -l /var/lib/jenkins/workspace/dummy.1box/1box-dummy.netfilter.x86_64.raw /tmp/tmp1447664384//tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
-rw-r--r-- 1 root root           0 Nov 16 18:00 /tmp/tmp1447664384//tmp/vmbuilder-grub/1box-dummy.netfilter.x86_64.raw
-rw-r--r-- 1 root root 21474836480 Nov 16 18:01 /var/lib/jenkins/workspace/dummy.1box/1box-dummy.netfilter.x86_64.raw
[root@ct68 dummy.1box]# 
triggers commented 8 years ago

It is possible for mount --bind to work inside LXC:

/ssh:lxc32: #$ cd testbindmount/
/ssh:lxc32: #$ date >ddd
/ssh:lxc32: #$ touch ddcopy 
/ssh:lxc32: #$ ls -l
total 4
-rw-rw-r-- 1 sysope sysope  0 Nov 17 04:06 ddcopy
-rw-rw-r-- 1 sysope sysope 29 Nov 17 04:06 ddd
/ssh:lxc32: #$ sudo mount --bind ddd ddcopy 
/ssh:lxc32: #$ ls -l
total 8
-rw-rw-r-- 1 sysope sysope 29 Nov 17 04:06 ddcopy
-rw-rw-r-- 1 sysope sysope 29 Nov 17 04:06 ddd
/ssh:lxc32: #$ cat /etc/centos-release 
CentOS release 6.7 (Final)

So why is it not working on 2.68?

triggers commented 8 years ago

"mount --bind" works in a new container created with https://github.com/wakameci/wakame-ci-cluster/blob/master/lxc-hosts/bootstrap-fedora-22.sh

But if the container is restarted with https://github.com/wakameci/wakame-ci-cluster/blob/master/lxc-hosts/lxc-stop.sh and https://github.com/wakameci/wakame-ci-cluster/blob/master/lxc-hosts/lxc-start.sh then, "mount --bind" stops working.

What is different between the two states? So far the most interesting thing is that polkitd is only running before stop/start. It is related to authorizations, so this difference could potentially be related. Still investigating....

triggers commented 8 years ago

Should have thought of this....mount does not return an error because it does work. The problem is that something is immediately unmount-ing the "--bind": (!)

/ssh:p26j: #$ mount --bind f80 mp ; echo $? ; ls -l mp ; sleep 1 ; ls -l mp
0
-rw-r--r-- 1 root root 29 Nov 18 15:17 mp
-rw-r--r-- 1 root root 0 Nov 18 19:35 mp
/ssh:p26j: #$ 
triggers commented 8 years ago

Starting polkitd had no effect, BTW.

triggers commented 8 years ago

Overview of debugging path and clues: (I may fill in more detail later)

Traced through by hand and found mount --bind did not work
googled for lxc, mount, etc.
Noticed by luck that fresh install worked.
Compared fresh install before and after stop/start, only a few obvious differences.
Tried activating polkitd, did not help.
Noticed by chaining commands (by luck again) that it really did mount!
starting googling for lxc umount.
looked at process list to look for processes that might be doing the unmount
proved that it was systemd by stopping all its processes:
googled for systemd and mount, found arch linux bug report -> systemd bug report -> patch
did yum update systemd, could argue that patch was not in it, but hard to know.
decided that going to fedora 20 or centos 6 would soon be the best alternative.
out of curiosity, looked at strace output of /sbin/init (systemd):

For example:

stat("/run", {st_mode=S_IFDIR|0755, st_size=560, ...}) = 0
stat("/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/dev/sda1", 0x7fff3be10fc0)       = -1 ENOENT (No such file or directory)
stat("/dev/sda1", 0x7fff3be10fc0)       = -1 ENOENT (No such file or directory)
open("/etc/systemd/system/home-triggers-mp.mount", O_RDONLY|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/run/systemd/system/home-triggers-mp.mount", O_RDONLY|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory)
guessed that adding /dev/sda1 with 'mknod /dev/sda1 -m 660 b 8 1' might work. It did.
put fix in https://github.com/wakameci/wakame-ci-cluster/blob/master/lxc-hosts/lxc-device.sh
did stop/start on all 192.168.2.26 containers
still don't know why it worked after a bootstrap-fedora-22.sh fresh build (it had no /dev/sda1)
triggers commented 8 years ago

Current patch:

diff --git a/lxc-hosts/lxc-device.sh b/lxc-hosts/lxc-device.sh
index c23afbf..3a2822e 100755
--- a/lxc-hosts/lxc-device.sh
+++ b/lxc-hosts/lxc-device.sh
@@ -50,6 +50,14 @@ lxc-attach -n ${ctid} -- bash -ex <<-EOS
   [[ -c /dev/ptmx ]] || mknod -m 666 /dev/ptmx c 5 2
 EOS

+# ASSUMES THAT /DEV/SDA IS THE CORRECT DEVICE (true for 192.168.2.26)
+lxc-attach -n ${ctid} -- bash -ex <<-EOS
+  [[ -b /dev/sda ]] || mknod /dev/sda -m 660 b 8 0
+  [[ -b /dev/sda1 ]] || mknod /dev/sda1 -m 660 b 8 1
+  [[ -b /dev/sdd ]] || mknod /dev/sdd -m 660 b 8 48
+  [[ -b /dev/sde ]] || mknod /dev/sde -m 660 b 8 64
+EOS
+
 # /dev/loopX and /dev/dm-X
 for i in {0..127}; do
 lxc-attach -n ${ctid} -- bash -ex <<-EOS

Next step is to push this to github.com/wakameci/wakame-ci-cluster/.

Also want to do a little sanity checking to make sure this makes sense. For example, there really is not any /dev/sda1 inside the container.

Also would be good to understand why a fresh build worked, even though /dev/sda1 did not exist. Maybe systemd was started with a different configuration the first time? Not sure.

triggers commented 8 years ago

https://wiki.archlinux.org/index.php/Linux_Containers#Systemd_considerations_.28required.29