amonakov / primus

Low-overhead client-side GPU offloading
ISC License
216 stars 20 forks source link

Segment fault on Xorg 1.20 #201

Closed AlynxZhou closed 6 years ago

AlynxZhou commented 6 years ago

I am using Arch Linux, since I upgrade my xorg-server to 1.20, primusrun always get a segment fault while optirun not, here is my journal:

May 21 15:46:47 pendragon kernel: bbswitch: enabling discrete graphics
May 21 15:46:47 pendragon kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 240
May 21 15:46:47 pendragon kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  396.24  Thu Apr 26 00:10:09 PDT 2018 (using threaded interrupts)
May 21 15:46:47 pendragon kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  396.24  Wed Apr 25 23:54:18 PDT 2018
May 21 15:46:47 pendragon kernel: nvidia-modeset: Allocated GPU:0 (GPU-2ffa395f-25a1-eaab-d6f5-1a2531e2cda8) @ PCI:0000:01:00.0
May 21 15:46:47 pendragon kernel: nvidia-modeset: Freed GPU:0 (GPU-2ffa395f-25a1-eaab-d6f5-1a2531e2cda8) @ PCI:0000:01:00.0
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065467] [WARN][XORG] (WW) Open ACPI failed (/var/run/acpid.socket) (No such file or directory)
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065482] [WARN][XORG] (WW) Warning, couldn't open module mouse
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065491] [WARN][XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065496] [WARN][XORG] (WW) NVIDIA(0): Option "NoLogo" is not used
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065504] [WARN][XORG] (WW) Warning, couldn't open module mouse
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065509] [ERROR][XORG] (EE) PreInit returned 2 for "<default pointer>"
May 21 15:46:47 pendragon bumblebeed[438]: [   46.065513] [ERROR][XORG] (EE) PreInit returned 2 for "<default keyboard>"
May 21 15:46:47 pendragon kernel: nvidia-modeset: Allocated GPU:0 (GPU-2ffa395f-25a1-eaab-d6f5-1a2531e2cda8) @ PCI:0000:01:00.0
May 21 15:46:47 pendragon kernel: nvidia-modeset: Freed GPU:0 (GPU-2ffa395f-25a1-eaab-d6f5-1a2531e2cda8) @ PCI:0000:01:00.0
May 21 15:46:47 pendragon kernel: glxgears[1432]: segfault at 74 ip 00007f5404112151 sp 00007f5400623b20 error 4 in i965_dri.so[7f5403f3a000+874000]
May 21 15:46:47 pendragon systemd[1]: Started Process Core Dump (PID 1434/UID 0).
May 21 15:46:48 pendragon kernel: nvidia-modeset: Unloading
May 21 15:46:48 pendragon systemd-coredump[1435]: Process 1416 (glxgears) of user 1000 dumped core.

                                                  Stack trace of thread 1432:
                                                  #0  0x00007f5404112151 n/a (i965_dri.so)
                                                  #1  0x00007f5404314f8d n/a (i965_dri.so)
                                                  #2  0x00007f5408c09425 n/a (libGL.so.1)
                                                  #3  0x00007f5407f3c075 start_thread (libpthread.so.0)
                                                  #4  0x00007f540824b53f __clone (libc.so.6)

                                                  Stack trace of thread 1416:
                                                  #0  0x00007f5407f44856 do_futex_wait.constprop.1 (libpthread.so.0)
                                                  #1  0x00007f5407f44958 __new_sem_wait_slow.constprop.0 (libpthread.so.0)
                                                  #2  0x00007f5408c0a4ac glXSwapBuffers (libGL.so.1)
                                                  #3  0x000056472e673a27 n/a (glxgears)
                                                  #4  0x00007f540817606b __libc_start_main (libc.so.6)
                                                  #5  0x000056472e67408a n/a (glxgears)

                                                  Stack trace of thread 1433:
                                                  #0  0x00007f5407f44856 do_futex_wait.constprop.1 (libpthread.so.0)
                                                  #1  0x00007f5407f44958 __new_sem_wait_slow.constprop.0 (libpthread.so.0)
                                                  #2  0x00007f5408c0aa7c n/a (libGL.so.1)
                                                  #3  0x00007f5407f3c075 start_thread (libpthread.so.0)
                                                  #4  0x00007f540824b53f __clone (libc.so.6)
May 21 15:46:48 pendragon kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 240
May 21 15:46:48 pendragon kernel: bbswitch: disabling discrete graphics
May 21 15:46:48 pendragon kernel: pci 0000:01:00.0: Refused to change power state, currently in D0

Thanks if anyone can help.

NerosTie commented 6 years ago

Same issue with xorg-server 1.20.0-6 on Arch.

CrafterSvK commented 6 years ago

Same issue on Arch

olekzonder commented 6 years ago

I have fixed this issue by downgrading Mesa to 18.0.4, that works just fine.I suppose we should wait for the newer release of MESA to fix this mess...

NerosTie commented 6 years ago

@turboasm123 But are they aware of this bug?

olekzonder commented 6 years ago

@NerosTie yes, they are.There is already a bug reported on bugs.archlinux.org

https://bugs.archlinux.org/task/58933

CrafterSvK commented 6 years ago

What if mesa is not problem. Primus is 2 years old. Shouldn't we update it to latest specification of mesa? Rather than wait for them?

Remove incomplete GLX_SGIX_swap_barrier stubs from the Xlib libGL
Remove incomplete GLX_SGIX_swap_group stubs from the Xlib libGL

They removed this stubs from libGL that may cause problems. Probably not (seems unlikely)

juliotux commented 6 years ago

Xorg 1.20 have now https://www.phoronix.com/scan.php?page=news_item&px=X.Org-Server-1.20-Features

Server-side GLVND "GLXVND" for allowing different OpenGL drivers to back different X screens. This should help in some multi-GPU setups and other combinations.

Wouldn't be the time to update the primusrun and use this feature?

olekzonder commented 6 years ago

@juliotux I'm sure it is used by nvidia-xrun, but it's not better than primus, since you have to start an X-session and switch every time you want to run anything.

juliotux commented 6 years ago

@turboasm123 nvidia-xrun do not use GLXVND. It just start a new X session with nvidia xorg config. As annouced, GLXVND runs in the same X session, but in different screens. The challenge would be bring the rendering of this second screen to the primary one. Another option is wrap around GLVND, as suggested by another users, but I couldn't find any good documentation about it.

vezaynk commented 6 years ago

Confirming "solution" from linked arch bug tracker. Downgrading mesa fixes the issue.

So... Probably not a primus issue. Mesa is definitely a culprit though.

olekzonder commented 6 years ago

I have also reported the bug on the Mesa bugtracker.Not much response, but still, there is some activity. Here is the link: https://bugs.freedesktop.org/show_bug.cgi?id=106910

vezaynk commented 6 years ago

It's frustrating how slowly this is all moving. It feels like one of those issues which will be left broken for years.

On the bright side, Optimus seems to be working fine.

CrafterSvK commented 6 years ago

Yes on a bright side. There are certain applications which won't work because they need LD_PRELOAD or something like that. Optimus won't pass this arguments and application will crash. Mount & Blade: Warband is one of them. That's why I like primus a lot more.

vezaynk commented 6 years ago

I have received a response via email which (for some reason?) has been removed.

Reposting:

@chriscjs: PRIMUS_UPLOAD=1 primusrun glxspheres64 solves this issue for me. I did not have to downgrade xorg or mesa.

I am confirming that it seems to work yet I have trouble finding any documentation as to what "PRIMUS_UPLOAD" does.

AlynxZhou commented 6 years ago

@knyzorg seems work for me...

chriscjs commented 6 years ago

@knyzorg I had removed the post after someone else posted on https://bugs.archlinux.org/task/58933 that it caused performance degradation compared to optirun, although it does not on my system. Thanks for reposting it. The following text is from the primusrun script in /usr/bin which might explain it a bit:

Upload/display method 0: autodetect, 1: textures, 2: PBO/glDrawPixels (needs Mesa-10.1+) export PRIMUS_UPLOAD=${PRIMUS_UPLOAD:-0}

A google search on glDrawPixels says it has been removed from opengl 3.2 and above and maybe that's one reason for this issue.

AlynxZhou commented 6 years ago

@chriscjs I have tested them on my pc, =1 gives the best performance, =2 works but not so good as =1, while optirun is the worst. =0 got a segment fault.

Here are results:

[alynx@pendragon:~] % vblank_mode=0 PRIMUS_UPLOAD=0 primusrun glxgears
ATTENTION: default value of option vblank_mode overridden by environment.
ATTENTION: default value of option vblank_mode overridden by environment.
zsh: segmentation fault (core dumped)  vblank_mode=0 PRIMUS_UPLOAD=0 primusrun glxgears
[alynx@pendragon:~]! % vblank_mode=0 PRIMUS_UPLOAD=2 primusrun glxgears
ATTENTION: default value of option vblank_mode overridden by environment.
ATTENTION: default value of option vblank_mode overridden by environment.
22144 frames in 5.0 seconds = 4428.668 FPS
18780 frames in 5.0 seconds = 3755.991 FPS
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 32 requests (32 known processed) with 0 events remaining.
X Error of failed request:  BadWindow (invalid Window parameter)
  Major opcode of failed request:  147 ()
  Minor opcode of failed request:  1
  Resource id in failed request:  0x3a00002
  Serial number of failed request:  48813
  Current serial number in output stream:  48814
^[[A^C
[alynx@pendragon:~]! % vblank_mode=0 PRIMUS_UPLOAD=1 primusrun glxgears
ATTENTION: default value of option vblank_mode overridden by environment.
ATTENTION: default value of option vblank_mode overridden by environment.
23965 frames in 5.0 seconds = 4792.905 FPS
26534 frames in 5.0 seconds = 5306.707 FPS
26583 frames in 5.0 seconds = 5316.538 FPS
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 32 requests (32 known processed) with 0 events remaining.
X Error of failed request:  BadWindow (invalid Window parameter)
  Major opcode of failed request:  147 ()
  Minor opcode of failed request:  1
  Resource id in failed request:  0x3a00002
  Serial number of failed request:  83881
  Current serial number in output stream:  83882
primus: warning: dropping a frame to avoid deadlock
^C
[alynx@pendragon:~]! % vblank_mode=0 optirun glxgears
ATTENTION: default value of option vblank_mode overridden by environment.
12713 frames in 5.0 seconds = 2542.409 FPS
13112 frames in 5.0 seconds = 2622.357 FPS
13046 frames in 5.0 seconds = 2609.111 FPS
[VGL] ERROR: in readback--
[VGL]    254: Window has been deleted by window manager
[alynx@pendragon:~]! % 
jebrosen commented 6 years ago

I can confirm that both PRIMUS_UPLOAD=1 and PRIMUS_UPLOAD=2 work and are significantly faster than optirun on my machine as well.

The fact that PRIMUS_UPLOAD=0 means "autodetect" but both choices are working, and the following code: https://github.com/amonakov/primus/blob/d1afbf6fce2778c0751eddf19db9882e04f18bfd/libglfork.cpp#L402

suggests that it's the autodetection itself (test_drawpixels_fast) that's at fault. Note that the PRIMUS_UPLOAD=2 method is itself implemented using glDrawPixels, so there is probably something special about test_drawpixels_fast that causes it to fail.

kmahyyg commented 6 years ago

I just found that this project is so outdated, the last commit is about 3 years ago....... I'm the one who issued the bug report on Arch Linux Bug Tracker. Use any of the method above will work, but runs so slowly.

So I strongly suggest you downgrade your mesa or check the Arch Linux Bug Tracker.

hellbound22 commented 6 years ago

Downgrading to mesa-18.0.4-1 did work for me. But yeah, this repo needs some updates.

damian01w commented 6 years ago

This is very frustrating bug, because primus its a very popular opensource project and now seems to be not working at all with recent Xorg/Mesa updates. Dear @amonakov , please, Are you still involved in the project? Can you contribute to solve this bug?

kmahyyg commented 6 years ago

This bug may caused by cooperating with X.org server, I think all mesa and primus and xorg devs may need to have a talk.

ribalda commented 6 years ago

I have just sent a pull request that fixes the bug for me.

CrafterSvK commented 6 years ago

Should we make a mirror of this repo and make it official if @amonakov is no longer involved?

vezaynk commented 6 years ago

Arch users can specify which git repo to clone from when using the AUR. However, we can go forward with a long term solution if anybody wants to become the new maintainer of primus and contact the repository maintainers to transfer ownership.

amonakov commented 6 years ago

Reports indicate that this is a result of regression in Xorg/Mesa that has to do with "DRI3 modifiers", so please try to engage more actively with Mesa/i965 developers about the issue.

Did anyone inform them of the bisection result shown on Arch forum? https://bbs.archlinux.org/viewtopic.php?pid=1789470#p1789470

In the meantime, running with PRIMUS_UPLOAD=2 should serve as a workaround as it skips the autodetection path.

CrafterSvK commented 6 years ago

All right. primusrun working without any workaround on mesa 18.1.6 on ArchLinux.

damian01w commented 6 years ago

Seems to be working fine now on ArchLinux and Debian testing up-to-date. Thanks!

vezaynk commented 6 years ago

Confirming as fixed.