virt-customize installs the kernel, but it's not running the post-transaction scripts

jeremycline commented 5 years ago

The kernel CI test fails early on when it checks the kernel version. I then downloaded the qcow2 image, booted it up, and discovered it's not booting the newly installed kernel. It looks like the kernel's post transaction script (which runs kernel-install add <kernel version>, generating the initramfs and grub entry) isn't being run.

I used virt-customize locally on the F28 cloud image to install the same kernel and it was properly installed, so it's not immediately obvious to me what's causing it to not happen in the CI environment.

johnbieren commented 5 years ago

@bgoncalv can you take a look at this? I can step in if needed

bgoncalv commented 5 years ago

@johnbieren I might need your help, I was trying to reproduce this as the pipeline does, but it worked for me.

I ran these steps on privileged fedora:latest container after installing required packges...

mkdir /tmp/30423008
cd /tmp/30423008
koji download-task --arch=x86_64 --arch=noarch 30423008
createrepo .
cd

curl -LO http://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20181021.n.0/compose/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20181021.n.0.x86_64.qcow2

LIBGUESTFS_BACKEND=direct virt-copy-in -a Fedora-Cloud-Base-Rawhide-20181021.n.0.x86_64.qcow2 /tmp/30423008 /tmp

LIBGUESTFS_BACKEND=direct virt-customize -v --selinux-relabel --memsize 4096 -a Fedora-Cloud-Base-Rawhide-20181021.n.0.x86_64.qcow2 --run-command "yum install -y --best --allowerasing --nogpgcheck --enablerepo=30423008 --repofrompath=30423008,/tmp/30423008 kernel kernel-core kernel-debug kernel-debug-core kernel-debug-devel kernel-debug-modules kernel-debug-modules-extra kernel-debuginfo-common-x86_64 kernel-devel kernel-modules kernel-modules-extra"

It all worked well, I could see dracut ran as it shows on output:

yum install -y --best --allowerasing --nogpgcheck --enablerepo=30423008 --repofrompath=30423008,/tmp/30423008 kernel kernel-core kernel-debug kernel-debug-core kernel-debug-devel kernel-debug-modules kernel-debug-modules-extra kernel-debuginfo-common-x86_64 kernel-devel kernel-modules kernel-modules-extra
"
[  242.304267] dracut[31241] No '/dev/log' or 'logger' included for syslog logging
[  242.396189] dracut[31241] Executing: /usr/bin/dracut -f /boot/initramfs-4.19.0-1.fc30.x86_64.img 4.19.0-1.fc30.x86_64
[  242.525169] dracut[31241] dracut module 'modsign' will not be installed, because command 'keyctl' could not be found!
[  242.560393] dracut[31241] dracut module 'busybox' will not be installed, because command 'busybox' could not be found!
[  242.618348] dracut[31241] dracut module 'lvmmerge' will not be installed, because command 'lvm' could not be found!
...

When I boot this image it boots correctly with 4.19.0-1.fc30.x86_64

I tried to rebuild the failed build in the pipeline and it failed again with the issue reported. As you can see here, on pipeline dracut does not run: https://jenkins-continuous-infra.apps.ci.centos.org/job/fedora-rawhide-build-pipeline/1087/artifact/cloud-image-compose/logs/console.log

bgoncalv commented 5 years ago

I just realized the pipeline uses centos:7 container, so I just ran the same steps using on it and again everything worked well, the server booted using correct kernel.

johnbieren commented 5 years ago

I'll check it out hopefully tomorrow

johnbieren commented 5 years ago

@bgoncalv So, I followed your steps inside of the exact OpenShift container that does this. I didn't catch if the dracut lines were there because the output came so quickly, but when I booted the VM:

[root@localhost ~]# rpm -qa kernel
kernel-4.19.0-1.fc30.x86_64
[root@localhost ~]# uname -r
4.19.0-0.rc8.git4.1.fc30.x86_64
[root@localhost ~]#

I don't know much about dracut, but I assume this means it did not run? Do you have any ideas how we can fix this? Maybe since the kernel is a special package we have some additional virt-customize command to run after installing the package to reboot it or change grub or something? I don't think it would be too bad to add a if statement just for kernel, it being the kernel. What do you think? Unfortunately, I don't know enough about how this works to have many ideas for the best remedy for this, but it for sure reproduces in OpenShift on the container.

jeremycline commented 5 years ago

I don't know much about dracut, but I assume this means it did not run?

Correct, dracut gets run as part of the kernel-install add script, which is run in the post-transaction scriptlet in the kernel spec file.

I wonder if this problem is actually specific to the kernel, or if no post-transaction scriptlets are being run. That seems a little weird, though.

johnbieren commented 5 years ago

I was figuring kernel was the exception since you have to boot a kernel and don't just install a new version and run it immediately (again, I am far from an expert on it).

@bgoncalv Can we do like a if $package == kernel, run kernel-install add script after the install? Does that make sense functionally?

jeremycline commented 5 years ago

I was figuring kernel was the exception since you have to boot a kernel and don't just install a new version and run it immediately (again, I am far from an expert on it).

It works as it should for me locally (not in a container), and also (apparently) in the vanilla centos:7 container. That seems to indicate there's something weird going on due to either OpenShift being involved or something added to the environment here.

I think an if $package == kernel is going to be fragile. What happens when something else is added to the post-transaction scriptlet? In addition to that, I'm concerned that it's just covering up the problem. Why aren't those rpm scripts being run?

As an aside, it'd be nice if there was way for someone who encounters a problem in a pipeline step to very easily get a reproducer script to create an identical environment.

bgoncalv commented 5 years ago

I agree, this seems to be some problem with the Openshift, having $package == kernel could be done for now as an workaround to have kernel tests running, but I'd like to understand why this is not working as it should on Openshift.

bgoncalv commented 5 years ago

@jeremycline @johnbieren not sure what has changed, but it seems now the kernel gets installed properly:

https://jenkins-continuous-infra.apps.ci.centos.org/view/Fedora%20All%20Packages%20Pipeline/job/fedora-rawhide-build-pipeline/1703/

CentOS-PaaS-SIG / ci-pipeline

virt-customize installs the kernel, but it's not running the post-transaction scripts #753