cockpit-project / cockpit-machines

Cockpit UI for virtual machines
GNU Lesser General Public License v2.1
298 stars 75 forks source link

coredump in c10s testing farm #1900

Closed jelly closed 6 days ago

jelly commented 2 weeks ago

See for example this PR and check the console log:

[  267.171160] coredump: 8261(browser.sh): Unsafe core_pattern used with fs.suid_dumpable=2: pipe handler or fully qualified core dump path required. Set kernel.core_pattern before fs.suid_dumpable.

https://artifacts.dev.testing-farm.io/2790ec3f-7a63-4dcf-b98d-b62519ce604c/work-storagerwkkep52/console-97fac4f4-dcc6-4fb3-a90b-6c420c4f0a6a.log

To debug this, remember that we can reserve a test VM with the testing-farm utility.

jelly commented 1 week ago

Most recent test runs have not shown a coredump but we still see the "ssh connection closed" issue during a test run of the storage tests:

To try to reproduce the issue I have reserved a VM and then ran:

dnf install -y podman cockpit-system cockpit-ws cockpit-bridge cockpit-machines virt-install dbus-tools firewalld  libvirt-daemon-driver-storage-iscsi libvirt-daemon-driver-storage-logical git
git clone https://github.com/cockpit-project/cockpit-machines.git
cd cockpit-machines

mkdir -p /root/.ssh
curl https://raw.githubusercontent.com/cockpit-project/bots/main/machine/identity.pub  >> /root/.ssh/authorized_keys
chmod 600 /root/.ssh/authorized_keys

    useradd -c Administrator -G wheel admin
    echo admin:foobar | chpasswd
echo root:foobar | chpasswd
su -c 'echo foobar | sudo --stdin whoami' - admin

podman pull ghcr.io/cockpit-project/tasks:2024-10-07
podman run --rm --shm-size=1024m --security-opt=label=disable --network=host --volume=/data:/logs:rw,U --env=LOGS=/logs --volume="$(pwd)":/source:rw,U --env=SOURCE=/source --volume=/usr/lib/os-release:/run/host/usr/lib/os-release:ro -ti ghcr.io/cockpit-project/tasks:2024-10-07 bash

TEST_OS=centos-10 TEST_BROWSER=firefox ./test/check-machines-disks -vst TestMachinesDisks.testDisks --machine localhost:22 --browser localhost:9090

To run tests for some reason I have to start virtnetworkd and virtstoraged as for some reason the test does not start them.

Attempted this three times and it cleanly disconnected randomly during running, I have not recorded the uptime of the VM.

The fourth time I started a tmux to curl --head my own website to see if the machine dies or sshd dies and when my ssh connection was cut the curl HEAD requests also stopped so the machine really goes offline. We also can't ping these machines from the outside (as we ssh via a jumphost into them).

Open questions:

jelly commented 1 week ago

I've reserved a test machine yesterday and just let it hang around, wasn't killed after > 1 hour so that theory does not hold up. Also note that the test ssh disconnect happens quite "fast" if the logs are to believed this failed in 240 seconds.

So I'm currently running out of ideas here on what could be the issue of the maching going away while running tests.

martinpitt commented 1 week ago

I found a kernel oops in https://github.com/cockpit-project/cockpit-machines/pull/1909#issuecomment-2479150095 -- there's at least a chance that this is the same issue. Would match the symptoms! And I can reproduce it with the centos-10 image rebuild.