ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.55k stars 268 forks source link

dev test will crash a VM with Noble OEM 6.10 kernel #407

Open Cypresslin opened 4 days ago

Cypresslin commented 4 days ago

Hi Colin, I found that the dev stressor smoke test will kill a Noble OEM 6.10 VM.

Steps:

# On a bare-metal running with Noble
sudo apt install uvtool build-essential -y
sudo uvt-simplestreams-libvirt sync --source http://cloud-images.ubuntu.com/daily release=noble arch=amd64
SSH_KEY="$HOME/.ssh/id_rsa"
ssh-keygen -f $SSH_KEY -t rsa -N ''
sudo -u ubuntu uvt-kvm create oem610 release=noble arch=amd64 --memory 2048
sleep 60
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=quiet -i .ssh/id_rsa ubuntu@`sudo uvt-kvm ip oem610`
# Inside the VM
sudo apt-add-repository ppa:canonical-kernel-team/ubuntu/proposed -y
sudo apt install kernel-testing--linux-oem-6.10--full--oem -y
sudo reboot
sleep 60
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=quiet -i .ssh/id_rsa ubuntu@`sudo uvt-kvm ip oem610`
git clone https://github.com/ColinIanKing/stress-ng.git
cd stress-ng; make
sudo ./stress-ng -v -t 5 --dev 4 --dev-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable
# VM will be terminated here, you will have to restart it.

Test output:

$ sudo ./stress-ng -v -t 5 --dev 4 --dev-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable
stress-ng: debug: [13723] invoked with './stress-ng -v -t 5 --dev 4 --dev-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable' by user 0 'root'
stress-ng: debug: [13723] stress-ng 0.18.00 gb308ea3174f5
stress-ng: debug: [13723] system: Linux oem610 6.10.0-1005-oem #5-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 20:44:06 UTC 2024 x86_64, gcc 13.2.0, glibc 2.39, little endian
stress-ng: debug: [13723] RAM total: 1.9G, RAM free: 1.2G, swap free: 0.0
stress-ng: debug: [13723] temporary file path: '/home/ubuntu/stress-ng', filesystem type: ext2 (678274 blocks available)
stress-ng: debug: [13723] 1 processor online, 1 processor configured
stress-ng: info:  [13723] setting to a 5 secs run per stressor
stress-ng: debug: [13723] CPU data cache: L1: 32K, L2: 4096K, L3: 16384K
stress-ng: debug: [13723] cache allocate: shared cache buffer size: 16384K
stress-ng: info:  [13723] dispatching hogs: 4 dev
stress-ng: debug: [13723] starting stressors
stress-ng: debug: [13724] dev: [13724] started (instance 0 on CPU 0)
stress-ng: debug: [13723] 4 stressors started
stress-ng: debug: [13725] dev: [13725] started (instance 1 on CPU 0)
stress-ng: debug: [13726] dev: [13726] started (instance 2 on CPU 0)
stress-ng: debug: [13727] dev: [13727] started (instance 3 on CPU 0)
(VM terminated)

This issue can be reproduced with V0.17.08 as well. And this test can pass on bare-metal with the same kernel.

ColinIanKing commented 4 days ago

I think this is a kernel bug in the DRI driver, I've cornered this to /dev/dri/card1, try the following; it reproduces the crash when I used the kernel kernel-testing--linux-oem-6.10

while true; do sudo ./stress-ng --dev 4 --dev-file /dev/dri/card1 -t 5; done

from my observations it occurs when the devices are being closed

ColinIanKing commented 4 days ago

Can trip it using single stressor instance too:while true; do sudo ./stress-ng --dev 1 --dev-file /dev/dri/card1 -t 5; done

ColinIanKing commented 4 days ago

Looks like a race in /dev/dri/card1 open/close. Here is a very simple reproducer, run as root:

#include <fcntl.h>
#include <unistd.h>

int main(void)
{
     pid_t pid = fork();

     while (1) {
        int fd;

        fd = openat(AT_FDCWD, "/dev/dri/card1", O_WRONLY|O_NONBLOCK|O_SYNC);
        close(fd);
     }
}

Definitely a kernel bug :-(

Cypresslin commented 3 days ago

OK thanks for the investigation, I will give it a try with the mainline kernel.

Cypresslin commented 3 days ago

Yes I can reproduce this issue with 6.10.0-061000rc4-generic (there is no debs for v6.10-rc6 amd64)

ColinIanKing commented 3 days ago

if you have an upstream kernel bug number or a Launchpad bug # for this please add it to this bug report so we can keep things tracked.

Cypresslin commented 3 days ago

Oh I do have one launchpad bug report: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2071756

ColinIanKing commented 1 day ago

Tested on a vanilla 6.10-rc6 kernel, reported upstream: https://bugzilla.kernel.org/show_bug.cgi?id=219007