ControlSystemStudio / cs-studio

Control System Studio is an Eclipse-based collections of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
https://controlsystemstudio.org/
Eclipse Public License 1.0
111 stars 96 forks source link

Fatal crash in databrowser on Linux #2660

Open willrogers opened 3 years ago

willrogers commented 3 years ago

Having built CS-Studio against the latest Eclipse version, we have a serious problem that I have not seen before.

Simply opening the databrowser causes CS-Studio to disappear immediately, leaving only the following error:

[xcb] Unknown request in queue while dequeuing
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
java: xcb_io.c:165: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.

@kasemir I realise that you won't have much interest in fixing this, but do you have any ideas as to where we might look? I expect the problem is somewhere in org.csstudio.swt.rtplot.

kasemir commented 3 years ago

The same happens in phoebus, https://github.com/ControlSystemStudio/phoebus/issues/705

With phoebus, I've occasionally seen it on startup. Googling around led me to believe that it can be caused by calling UI code off the UI thread, but unclear where that would be taking place.

GuillemCastro commented 3 years ago

We had the same error in ITER. Every time we'd open the databrowser CS-Studio would crash.

Debugging I found that these lines aren't executed in a UI thread. We solved it by wrapping them in a Display.syncExec.

kasemir commented 3 years ago

The whole point of the data browser is that the image is created in a background thread, so moving updateImageBuffer() to the UI thread means the UI is slowing down as you plot more and more data. In the phoebus/JavaFX version, there were occasional issues with obtaining an in-memory image buffer, so the code to allocate an in-memory image buffer moved to the UI thread, but the code which then keeps re-using that buffer to draw into the buffer runs in background threads.

willrogers commented 3 years ago

@GuillemCastro lucky you saw this ticket, thanks. That gives us a lot more information.

willrogers commented 3 years ago

We have tried putting updateBufferImage in syncExec and it certainly appears to fix the problems we are seeing. I can't yet assess the performance impact.

I can't find any way to make a smaller change that fixes the problems. The code as it is will fail in a number of ways and even if I change updateBufferImage just to draw a small red square, CS-Studio will often deadlock if I play around for a while.

willrogers commented 3 years ago

See https://github.com/dls-controls/cs-studio/commit/a18f978f9e3fb78a695db37c29991d01e053cd12 for that change.

kasemir commented 3 years ago

Can you tell where it's locking/failing? Is it the initial new Image call? In the JavaFX version, the comparable call to get an in-memory image was problematic and had to happen on the UI thread, so some https://github.com/ControlSystemStudio/phoebus/blob/master/core/ui/src/main/java/org/phoebus/ui/javafx/BufferUtil.java is used to handle that, and the rest of the actual drawing code can then run in the background.

willrogers commented 3 years ago

I made a minimal copy of your Phoebus code and then caught a deadlock in the debugger, you can see quite a lot in this image. It appears to have locked at the display.syncExec(), oddly enough, before it tries to create the image. I didn't set a breakpoint, I waited until it locked up and then suspended the thread to see the stack trace.

I notice that there are two "RTPlotUpdateThrottle" threads, I don't know whether that is correct.

deadlock

kasemir commented 3 years ago

For what it's worth, in the current phoebus incarnation the "RTPlotUpdateThrottle" is different. There's just one global "UpdateThrottle" thread (since there's also just one UI, after all), and an "RTPlot" pool sized to match the CPU core count. The images are also created in a double-buffer scheme, i.e. 2 per plot which are then re-used. So it's all quite different and doesn't help at all to explain what you're seeing in the SWT code.

I can't tell what's blocking here, i.e. who's holding the lock, but I think the approach could be changed. 'sync' exec blocks, so with that you wouldn't even need a CompletableFuture. I would use asyncExec, and then actually utilize the CompletableFuture to await the result:

// On background thread
// TODO Add some check to assert that we _are_ on a background thread
CompletableFuture<BufferUtil> result = new CompletableFuture<>();

// On display thread, at some time in the near future, create the buffer
Display.asyncExec( () -> result.complete(new BufferUtil(...    );

// Back off the display thread, await the result
BufferUtil buffer = result.get(500, TimeUnit.MILLISECONDS);
kasemir commented 3 years ago

The way you use syncExec, you might as well put the result in an AtomicReference:

AtomicReference <BufferUtil> result = new AtomicReference <>();

// On display thread, at some time in the near future, create the buffer
Display.syncExec( () -> result.set(new BufferUtil(...    );

// As we return from the blocking syncExec, result has been set.
// ... or we block for a while, because blocking is what syncExec is meant to do
BufferUtil buffer = result.get();
willrogers commented 3 years ago

Aha yes AtomicReference will be what I want.

kasemir commented 3 years ago

Actually, I'd try asyncExec to decouple the threads.

kasemir commented 3 years ago

The 'syncExec' can block forever (or until the next power outage). With asyncExec, the CompletableFuture.get(..) will time out after 500ms, so you skip one update, but you're not stuck.

willrogers commented 3 years ago

I didn't actually understand that asyncExec() was also set to run on the UI thread, I was thinking that it meant run a task in the background. I guess that's a fairly fundamental misunderstanding. Anyway, now I have moved it to an asyncExec as you suggest and it tells me a bit more - once the UI locks up it is indeed timing out after 500ms every time, presumably because the UI thread is blocked. I assume that Thread [main] is the UI thread.

willrogers commented 3 years ago

It seems that any use of the GC in a background thread is causing the problems.

kasemir commented 3 years ago

Bummer. It's certainly possible that the latest version of SWT limits all access to the UI thread, i.e. can no longer use SWT to prepare images in the background. So you'll have to run updateImageBuffer() in the UI thread. Or use databrowser3, or phoebus, which actually draw in the background with AWT.

willrogers commented 3 years ago

I asked on an Eclipse mailing list but it hasn't got through moderation, and I asked on the Eclipse forums but I haven't had an answer.

I think we are going to have to move more processing to the UI thread, and then revisit this either if we see performance problems or if I find out any more information about SWT.

kasemir commented 3 years ago

Not sure if there is an officially blessed SWT example for preparing images in background threads, then loading them on the SWT thread. There are various examples, like https://www.eclipse.org/forums/index.php/t/168168/ first running into Invalid thread access, then followed by an example that basically does offScreenImage = new Image(display, 400, 300); ... draw ... in a background thread, ending in display.asyncExec(... show offScreenImage ..), which is what we've been doing.

But there is no such example in the official https://www.eclipse.org/swt/snippets/

willrogers commented 3 years ago

I have opened an Eclipse bug here: https://bugs.eclipse.org/bugs/show_bug.cgi?id=568859. Here is a relevant reply from the mailing list.

Hi Will,

In general most of SWT code expects to be executed on UI (main) thread only, but you seem to work on non UI thread? If you can provide a standalone reproducer example, please create bug for that, we should discuss on bug. I guess the new behavior is related to removing locks around native GTK calls. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=546743#c17 https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/155907 https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/155989 .

Kind regards, Andrey Loskutov

kasemir commented 3 years ago

Try Snippet141 from https://www.eclipse.org/forums/index.php/t/168168/ :

new Thread("Animation")
{
                        public void run()
                        {
                            /* Create an off-screen image to draw on, and fill it with the shell background. */
                            Image offScreenImage = new Image(display, 400, 300);
                            GC offScreenImageGC = new GC(offScreenImage);
....

combined with the fix suggested further down in that thread to then use getDisplay().asyncExec(.. to place that image created in a background thread in the UI.

willrogers commented 3 years ago

It seems that there has been a bug introduced that causes problems when creating an Image on a background thread. I expect that I will need to keep the workaround until that bug is fixed.