Ferada / cl-cffi-gtk

#cl-cffi-gtk on Freenode. A Lisp binding to GTK+3. SBCL/CCL/ABCL (ECL/CLISP unstable)
http://www.crategus.com/books/cl-cffi-gtk
41 stars 8 forks source link

Thread-safety issues in gobject #54

Open lokedhs opened 3 years ago

lokedhs commented 3 years ago

As I have been working on a GTK backend for McCLIM, I have had regular deadlocks. As I have been investigating the root cause of these deadlocks, I believe I understand what is going on. However, a fix is complicated which is why I'm opening this issue so that I fill in some of the banks in my understanding before I start working on this.

The deadlock happens because I am creating gobject instances in one thread (in this case the repl thread). This results in *foreign-gobjects-lock* being held. While this lock is being held, it then tries to acquire *gobject-gc-hooks-lock*. However, this second lock is already held by the GTK thread.

Now, while *gobject-gc-hooks-lock* was held by the GTK thread, the finaliser kicked in, and the first thing the finaliser tries to do is to acquire *foreign-gobjects-lock* which is already held by the repl thread, resulting in a deadlock.

The simplest workaround I can think of, which I haven't tried yet, is to merge these two locks into a single one. This should fix this most common cause of this issue. However, it's not a proper solution since the issue could happen with any lock being held while the finaliser is run. *gobject-gc-hooks-lock* just happens to be the most common one, since it's used very often.

The ideal solution would be to get rid of the lock in the finaliser. This is where I am not able to suggest a solution since I don't fully understand the architecture.

Another issue I have noted is that a lot of the global variables that control these things are accessed without holding any locks. This can read to corrupt data (especially on non-Intel architectures that has a much more relaxed cache guarantees).

Have these issues been discussed in the past?

lokedhs commented 3 years ago

I think I should also add some details explaining why I see these issues so often when for use-cases, getting a deadlock would be very rare.

In McCLIM, you can have multiple "application threads" (each application frame has its own). Rendering is done from the application thread, which is obviously incompatible with GTK's requirement that any GTK operations are performed on the GTK thread. The way I have solved this is to have one cairo image-surface for each window, and the rendering from the McCLIM application thread only uses cairo to draw into the image. The redraw event in GTK copies the content of the image to the window.

The one operation which causes most of the gobject allocations from the McCLIM application threads is font drawing. I suspect it's all the pango calls that cause these deadlocks to happen with some regularity.

Ferada commented 3 years ago

FWIW yes, there are issues. Have they been discussed, perhaps, it's not like I'm the first to work on this. I'm going to play around with the locks. In general I'd also rather have fewer of those things around as I've also run into numerous unexplained lockups that I suspect are threading-related.