ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.79k stars 284 forks source link

Left-over process found with V0.14.6 #242

Closed Cypresslin closed 2 years ago

Cypresslin commented 2 years ago

Hi Colin,

I found this issue while testing the V0.14.6 update.

This can be spotted on some of our testing nodes with Ubuntu Jammy 5.15.0-53.59, including:

Although the test suite has finished without any error, there will be some left-over processes preventing our autotest framework process to finish cleanly:

 Summary:
   Stressors run: 244
   Skipped: 4,  binderfs pci plugin smi
   Failed:  0, 
   Oopsed:  0, 
   Oomed:   0, 
   Passed:  240,  access af-alg affinity aio aiol alarm bad-altstack bad-ioctl bigheap branch brk cache cacheline cap chattr chdir chmod chown chroot clock close context cpu crypt cyclic daemon dccp dekker dentry dev dev-shm dir dirdeep dirmany dnotify dup dynlib enosys env epoll eventfd exit-group fallocate fanotify fault fcntl fiemap fifo file-ioctl filename flock fork fp-error fpunch fstat full funcret futex get getdent getrandom goto gpu handle hash hdd hrtimers icache icmp-flood inode-flags inotify io iomix ioprio io-uring ipsec-mb itimer jpeg judy key kill klog kvm landlock lease link list loadavg locka lockbus lockf lockofd loop madvise malloc mcontend membarrier memfd memhotplug memrate memthrash mergesort mincore misaligned mknod mlock mmap mmapaddr mmapfixed mmapfork mmaphuge mmapmany mprotect mq mremap msg msync msyncmany munmap mutex nanosleep netdev netlink-proc netlink-task nice null open pagemove pageswap personality peterson physpage pidfd ping-sock pipe pipeherd pkey poll prctl prefetch procfs pthread ptrace pty radixsort randlist ramfs rawdev rawpkt rawsock rawudp readahead reboot regs rename resched revio rlimit rmap rseq rtc schedpolicy sctp seal seccomp secretmem seek sem sem-sysv sendfile session set shellsort shm shm-sysv sigabrt sigchld sigfd sigfpe sigio signal signest sigpending sigpipe sigq sigrt sigsegv sigsuspend sigtrap skiplist sleep sock sockabuse sockdiag sockmany softlockup sparsematrix splice stackmmap stream swap switch symlink sync-file syncload sysbadaddr syscall sysfs tee timer timerfd tlb-shootdown tmpfs touch tree tsearch tun udp udp-flood unshare urandom userfaultfd usersyscall utime vdso vecfp vecshuf vecwide verity vfork vm vm-addr vm-rw vm-segv vm-splice wait x86syscall yield zero zombie
   Badret:  0, 

 Tests took 440 seconds to run
$ ps aux | grep stress
root        1615  0.0  0.3  26920 13228 ?        S    09:16   0:00 /usr/bin/python2 -u autotest/client/autotest-local --verbose autotest/client/tests/ubuntu_stress_smoke_test/control
root        1616  0.0  0.3  26920 13228 ?        S    09:16   0:00 /usr/bin/python2 -u autotest/client/autotest-local --verbose autotest/client/tests/ubuntu_stress_smoke_test/control
root      197915  0.0  0.0  20648   476 ?        S    09:23   0:00 stress-ng-syscall [run]
root      197928  0.0  0.0  20648   476 ?        S    09:23   0:00 stress-ng-syscall [run]
ubuntu    240188  0.0  0.0   6472  1892 pts/0    R+   09:29   0:00 grep --color=auto stress

This will make the jenkins job hang, and being killed eventually with the timeout setting on Jenkins.

It looks like the cause is the syscall stressor, and my bisect result suggests the same:

8d25c65cdc355edd49e895c921f2766a0ee3334d is the first bad commit
commit 8d25c65cdc355edd49e895c921f2766a0ee3334d
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Thu Sep 1 21:13:37 2022 +0100

    stress-syscall: Add new stressor to hammer a range of system calls

    Signed-off-by: Colin Ian King <colin.i.king@gmail.com>

Thanks

ColinIanKing commented 2 years ago

Thanks for the report, I will investigate this. Can you attach strace to the running stressors to see if there is a specific system call it gets locked up on? e.g. sudo strace -p stress-ng-process-id

ColinIanKing commented 2 years ago

All processes that are running are blocked on a rt_sigsuspend([], 8...)

ColinIanKing commented 2 years ago

Fix committed:

commit cacea4982733e02af4e29e6d2dbd3d687af2b89b (HEAD -> master) Author: Colin Ian King colin.i.king@gmail.com Date: Mon Nov 14 12:54:51 2022 +0000

stress-syscall: terminate sigsuspend syscall child proceses
ColinIanKing commented 2 years ago

Can you apply this (it may need wiggling a minor amount) to your repo. I was able to reproduce this on a 24 thread ARM dev box and with the fix it no longer occurs.

Cypresslin commented 2 years ago

Hey Colin, that's super fast, I have this fix verified with one of our zVM and the hang issue does not exist anymore. Thank you! Sam