Closed jelly closed 6 days ago
Most recent test runs have not shown a coredump but we still see the "ssh connection closed" issue during a test run of the storage tests:
To try to reproduce the issue I have reserved a VM and then ran:
dnf install -y podman cockpit-system cockpit-ws cockpit-bridge cockpit-machines virt-install dbus-tools firewalld libvirt-daemon-driver-storage-iscsi libvirt-daemon-driver-storage-logical git
git clone https://github.com/cockpit-project/cockpit-machines.git
cd cockpit-machines
mkdir -p /root/.ssh
curl https://raw.githubusercontent.com/cockpit-project/bots/main/machine/identity.pub >> /root/.ssh/authorized_keys
chmod 600 /root/.ssh/authorized_keys
useradd -c Administrator -G wheel admin
echo admin:foobar | chpasswd
echo root:foobar | chpasswd
su -c 'echo foobar | sudo --stdin whoami' - admin
podman pull ghcr.io/cockpit-project/tasks:2024-10-07
podman run --rm --shm-size=1024m --security-opt=label=disable --network=host --volume=/data:/logs:rw,U --env=LOGS=/logs --volume="$(pwd)":/source:rw,U --env=SOURCE=/source --volume=/usr/lib/os-release:/run/host/usr/lib/os-release:ro -ti ghcr.io/cockpit-project/tasks:2024-10-07 bash
TEST_OS=centos-10 TEST_BROWSER=firefox ./test/check-machines-disks -vst TestMachinesDisks.testDisks --machine localhost:22 --browser localhost:9090
To run tests for some reason I have to start virtnetworkd
and virtstoraged
as for some reason the test does not start them.
Attempted this three times and it cleanly disconnected randomly during running, I have not recorded the uptime of the VM.
The fourth time I started a tmux to curl --head
my own website to see if the machine dies or sshd
dies and when my ssh connection was cut the curl HEAD requests also stopped so the machine really goes offline. We also can't ping these machines from the outside (as we ssh via a jumphost into them).
Open questions:
I've reserved a test machine yesterday and just let it hang around, wasn't killed after > 1 hour so that theory does not hold up. Also note that the test ssh disconnect happens quite "fast" if the logs are to believed this failed in 240 seconds.
So I'm currently running out of ideas here on what could be the issue of the maching going away while running tests.
I found a kernel oops in https://github.com/cockpit-project/cockpit-machines/pull/1909#issuecomment-2479150095 -- there's at least a chance that this is the same issue. Would match the symptoms! And I can reproduce it with the centos-10 image rebuild.
See for example this PR and check the console log:
https://artifacts.dev.testing-farm.io/2790ec3f-7a63-4dcf-b98d-b62519ce604c/work-storagerwkkep52/console-97fac4f4-dcc6-4fb3-a90b-6c420c4f0a6a.log
To debug this, remember that we can reserve a test VM with the testing-farm utility.