bakkeby / dwm-flexipatch

A dwm build with preprocessor directives to decide which patches to include during build time
MIT License
1.17k stars 235 forks source link

dwm terminated with signal SIGSEGV, segmentation fault #436

Open adolfgatonegro opened 2 weeks ago

adolfgatonegro commented 2 weeks ago

Hey, I'm running into an issue with dwm, similar to #324.

SYSTEM: Arch Linux KERNEL: 6.11.2-arch1-1 NVIDIA DRIVER: 560.35.03-11 XORG X SERVER: 21.1.13-1 DWM VERSION: dwm-6.5 (last commit: 36cbcf53a232818e5d523dd0337bb635556e91ef)

My regular build uses flexipatch, though I'm also seeing the issue with the latest unmodified dwm from upstream.

Issue

Installing after compilation, with sudo make install, causes dwm to crash, dropping me to the TTY. Sometimes it happens right after the install finishes, sometimes it takes a couple of seconds; regardless it crashes every time without further input on my part (not even triggering a restart of dwm myself).

Additional info

So far, this issue happens only on my desktop, which has an NVIDIA GPU. I am using the same build on my laptop, with AMD graphics, and everything seems to work correctly. Never mind, it is now happening on both of my systems.

This has not been an issue before kernel update 6.11. I had been using dwm-flexipatch based on dwm 6.4 since early last year, and everything worked fine. The issue started happening with my 6.4 build, and remains after a fresh build of 6.5.

I can reliably reproduce the issue with an unmodified build of dwm-flexipatch, without any customisation or patching, so it does not seem like an issue with any particular patch I'm using.

I've managed to dig up the following information. I'm not a developer and have no experience debugging software, so I might be missing something obvious.

  1. dmesg log
[  506.820021] dwm[853]: segfault at 54b6 ip 00000000000054b6 sp 00007fff284aa998 error 14 likely on CPU 10 (core 4, socket 0)
[  506.820031] Code: Unable to access opcode bytes at 0x548c.
  1. Debugging coredump with gdb
Core was generated by `dwm'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000000054b6 in XNextEvent@plt ()
(gdb) bt
#0  0x00000000000054b6 in XNextEvent@plt ()
#1  0x0000577c31be88a8 in ?? ()
#2  0x0000000100000001 in ?? ()
#3  0x0001000100000003 in ?? ()
#4  0x00007b070000000e in ?? ()
#5  0x00000000000007a2 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) 

This is as far as I've got. I assume XNextEvent is related to Xorg in some way, but I have not been able to find any references to issues like this.

Do let me know if there's anything else I can look at, and apologies if this is not the right place to submit this issue. Seems to affect upstream as well, but maybe something can come out of posting here.

Cheers

rayvermey commented 2 weeks ago

this happens since kernel 6.11 Why i do not know When you downgrade tot kernel 6.10 all is back to normal

bakkeby commented 2 weeks ago

The only report I have so far is that this happens with Kernel 6.11 and does not happen with Kernel 6.10. This also happens with a bare dwm.

The crash / segmentation violation seems to be in relation to the binary file being overwritten, as manually moving the old file away (from /usr/local/bin) before compiling seemingly mitigates the issue.

I'll let you know once I know more.

adolfgatonegro commented 2 weeks ago

The only report I have so far is that this happens with Kernel 6.11 and does not happen with Kernel 6.10. This also happens with a bare dwm.

The crash / segmentation violation seems to be in relation to the binary file being overwritten, as manually moving the old file away (from /usr/local/bin) before compiling seemingly mitigates the issue.

I'll let you know once I know more.

This is indeed an issue with 6.11. Rolling back to 6.10 prevents this from happening, as does using the 6.6 LTS kernel, which is what I'm currently doing.

Thanks for looking into it, mate. Let me know if I can provide any additional info or test anything to help.

gozenka commented 2 weeks ago

6.11.1-arch1-1 Plain dwm and with some light patching. Intel iGPU: Intel Corporation HD Graphics 630

Oct 07 03:47:27 zn systemd[1]: Starting /usr/bin/make install...
Oct 07 03:47:27 zn systemd[1]: Started /usr/bin/make install.
Oct 07 03:47:27 zn systemd[1]: run-u44.service: Deactivated successfully.
Oct 07 03:47:27 zn kernel: dwm[728]: segfault at 819e ip 000000000000819e sp 00007ffe5cf6f988 error 14 likely on CPU 2 (core 2, socket 0)
Oct 07 03:47:27 zn kernel: Code: Unable to access opcode bytes at 0x8174.

Similar output when make install aslstatus, as I wanted to check it with another application.

Oct 07 04:08:15 zn kernel: temperature[4248]: segfault at 55e7 ip 00000000000055e7 sp 0000765ed57ffd58 error 14 likely on CPU 3 (core 3, socket 0)
Oct 07 04:08:15 zn kernel: Code: Unable to access opcode bytes at 0x55bd.
bakkeby commented 2 weeks ago

Running lsof showed that the process holds a file descriptior of type "mem" pointing to the the binary file.

$ sudo lsof | grep -E "COMMAND|/usr/local/bin/dwm"
COMMAND     PID   TID TASKCMD               USER  FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
dwm        2697                         sbakkeby txt       REG               0,27    448376   30882447 /usr/local/bin/dwm
dwm        2697                         sbakkeby mem       REG               0,26             30882447 /usr/local/bin/dwm (path dev=0,27)

I am assuming that this is a new thing in Kernel 6.11.

My interpretation of what is happening here is that when we re-compile and install dwm the binary data of the file handle (/usr/local/bin/dwm) is being overwritten ultimately causing a segmentation fault for the process holding the memory file handle.

A quick workaround for this issue is to delete the original file before we copy the new file.

diff --git a/Makefile b/Makefile
index ffa69b4..c5e7554 100644
--- a/Makefile
+++ b/Makefile
@@ -32,6 +32,7 @@ dist: clean

 install: all
        mkdir -p ${DESTDIR}${PREFIX}/bin
+       rm -f ${DESTDIR}${PREFIX}/bin/dwm
        cp -f dwm ${DESTDIR}${PREFIX}/bin
        chmod 755 ${DESTDIR}${PREFIX}/bin/dwm
        mkdir -p ${DESTDIR}${MANPREFIX}/man1

Here is what the lsof output looks like after the file has been deleted (or is moved).

$ sudo lsof | grep -E "COMMAND|/usr/local/bin/dwm"
COMMAND     PID   TID TASKCMD               USER  FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
dwm        2697                         sbakkeby txt       REG               0,27    448376   30882447 /usr/local/bin/dwm (deleted)
dwm        2697                         sbakkeby DEL       REG               0,26             30882447 /usr/local/bin/dwm
gozenka commented 2 weeks ago

I'm just following this out of curiosity, I do not know much about what I am doing. I thought of checking lsof too but didn't know what to do with the output.

Are any of the reports from distros other than Arch Linux?

In case it might offer more clues, here is some information from my system:

So, those might be unrelated? Maybe related to the filesystem? I use ext4. In case it might be related to swap, zram, etc., I have none of those on my system. I also tried suspend / wakeup, no difference.

% sudo lsof | grep -iE "COMMAND|bin/dwm"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
dwm       1834                              km txt       REG              254,0      67920    3543170 /usr/local/bin/dwm

% sudo rm -f /usr/local/bin/dwm

% sudo lsof | grep -iE "COMMAND|bin/dwm"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
dwm       1834                              km txt       REG              254,0      67920    3543170 /usr/local/bin/dwm (deleted)

Trying with other applications:

pacman nsxiv:

% sudo lsof | grep -iE "COMMAND|bin/nsxiv"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
nsxiv     3837                              km txt       REG              254,0      88712    3546241 /usr/bin/nsxiv

% sudo pacman -S nsxiv

% sudo lsof | grep -iE "COMMAND|bin/nsxiv"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
nsxiv     3837                              km txt       REG              254,0      88712    3546241 /usr/bin/nsxiv (deleted)

git nsxiv:

Oct 07 19:37:42 zn kernel: nsxiv[3677]: segfault at 5aa6 ip 0000000000005aa6 sp 00007fff7170e3e8 error 14 likely on CPU 3 (core 3, socket 0)
Oct 07 19:37:42 zn kernel: Code: Unable to access opcode bytes at 0x5a7c.

aslstatus:

Oct 07 19:43:56 zn kernel: cpu_percentage[3771]: segfault at 3dac ip 0000000000003dac sp 00007b49c3dffd58 error 14 likely on CPU 2 (core 2, socket 0)
Oct 07 19:43:56 zn kernel: Code: Unable to access opcode bytes at 0x3d82.

[...]

Oct 07 19:45:24 zn kernel: temperature[4276]: segfault at 55e7 ip 00000000000055e7 sp 0000763b359ffd58 error 14
Oct 07 19:45:24 zn kernel: ram_used[4277]: segfault at 5481 ip 0000000000005481 sp 0000763b34fffd58 error 14
Oct 07 19:45:24 zn kernel:  likely on CPU 0 (core 0, socket 0)
Oct 07 19:45:24 zn kernel:  likely on CPU 2 (core 2, socket 0)
Oct 07 19:45:24 zn kernel:
Oct 07 19:45:24 zn kernel: Code: Unable to access opcode bytes at 0x55bd.
Oct 07 19:45:24 zn kernel: Code: Unable to access opcode bytes at 0x5457.