ControlSystemStudio / phoebus

A framework and set of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
http://phoebus.org/
Eclipse Public License 1.0
90 stars 90 forks source link

Linux/NVIDIA desktop lockup unless using prism.order=sw #512

Closed kasemir closed 5 years ago

kasemir commented 5 years ago

Adding to the Linux woes #353, #367, we observe the following on RHEL 7.6 computers with NVIDIA graphics cards (nvidia-smi reports version 340.107).

On the computer, running multiple copies via ssh or a ThinLinc remote desktop is fine. Starting multiple instances at the physical desktop causes the second, 3rd or 4th copy to be very slow. That sluggish instance doesn't need to execute any display. Simply trying to open the "File" menu will

When attaching JProfiler, the JVM doesn't consider itself using much CPU at those times. Instead, the UI thread is blocked in GlassScene.waitForRenderingToComplete(), basically calling the GTK graphics library.

Sometimes there are these types of messages in /var/log/messages suggesting a graphics driver problem:

kernel: NVRM: GPU at PCI:0000:01:00: GPU-95a676e2-3d89-3607-cd49-b7ad9d23f9f8
kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0005, Class 00005039, Offset 00000100, Data 00000000

After killing the problematic instance of phoebus, the X server might be stuck using a CPU core, and the only way to fix it is

sudo systemctl restart display-manager

Adding this to the JVM options, i.e. disabling accelerated graphics as in principle supported by the NVIDIA driver, seems to avoid the issue:

-Dprism.order=sw
kasemir commented 5 years ago

We no longer see the issue with RHEL 7.6 and updated NVidia h/w and drivers, where nvidia-smi reports NVIDIA-SMI 418.56, Driver Version: 418.56, CUDA Version: 10.1 with Quadro P600

shroffk commented 5 years ago

great!! should this ticket me closed?