JuliaGraphics / Gtk.jl

Julia interface to Gtk windowing toolkit.
Other
286 stars 80 forks source link

cascading failure causing segfault in GtkReactive tests? #562

Closed vtjnash closed 3 years ago

vtjnash commented 3 years ago

https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2021-02/11/GtkReactive.1.7.0-DEV-50742c7e42.log

(julia:170): GLib-GObject-CRITICAL **: 06:21:42.833: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:170): GLib-GObject-CRITICAL **: 06:21:42.843: g_object_setv: assertion 'G_IS_OBJECT (object)' failed

(julia:170): GLib-GObject-CRITICAL **: 06:21:42.872: g_closure_new_object: assertion 'G_IS_OBJECT (object)' failed

signal (11): Segmentation fault
in expression starting at /home/pkgeval/.julia/packages/GtkReactive/gQSXq/test/runtests.jl:22
unsafe_store! at ./pointer.jl:118 [inlined]
_signal_connect at /home/pkgeval/.julia/packages/Gtk/TgEIW/src/GLib/signals.jl:40
signal_connect at /home/pkgeval/.julia/packages/Gtk/TgEIW/src/GLib/signals.jl:27 [inlined]
signal_connect at /home/pkgeval/.julia/packages/Gtk/TgEIW/src/GLib/signals.jl:27
tknopp commented 3 years ago

Is this a) reproducible and b) also happens in 1.6? I also have seen segfaults on 1.6-beta on an internal software of mine. A minimal reproducible example would be good to track this down.

giordano commented 3 years ago

Those are the tests run by PkgEval. Note that tests for nightly have been been failing forever since we switched to Github Actions: https://github.com/JuliaGraphics/Gtk.jl/actions?query=workflow%3ACI

tknopp commented 3 years ago

Hm, yes. But nightly != 1.6 right?

giordano commented 3 years ago

Yes, but I was referring to point 1, tests have been consistently failing for months. Maybe looking at them might help. Julia v1.6 isn't tested at the moment in CI (but you just need to add one line to the matrix)

tknopp commented 3 years ago

Hm, "Error: Could not find a Julia version that matches 1.6". Do you know what to enter there? 1.6-?

tknopp commented 3 years ago

tests passing locally on Julia 1.6rc1

giordano commented 3 years ago

At the moment it's ^1.6.0-0, it'll be 1.6.0, 1.6, or 1 when it'll be released

tknopp commented 3 years ago
radio: Error During Test at /home/runner/work/Gtk.jl/Gtk.jl/test/gui.jl:340
170
  Test threw exception
171
  Expression: get_gtk_property(get_gtk_property(r, :active), :label, AbstractString) == choices[2]
172
  no active elements in GtkRadioGroup

seems unrelated to the segfault though.

tknopp commented 3 years ago

ok, its the same test failure as on nightly.

tknopp commented 3 years ago

I do not get why this is failing on Julia 1.6 (CI) but not not on 1.3 (CI) and also not locally on 1.6.

tknopp commented 3 years ago

Locally I get:

julia> get_gtk_property(get_gtk_property(r,:active),:label,AbstractString)
"choice two"
tknopp commented 3 years ago

@giordano: Can you somehow identify if different Gtk binaries are used for different Julia versions?

giordano commented 3 years ago

also not locally on 1.6.

That's the fun part of running CI, isn't it? 😃

Note that you can SSH into the github actions runners for debugging with the step

    - name: Setup tmate session
      uses: mxschmitt/action-tmate@v3
      with:
        limit-access-to-actor: true

right before the run tests step (you can remove the last two lines if you don't want to use your ssh key to login). See https://github.com/marketplace/actions/debugging-with-tmate

timholy commented 3 years ago

TIL, thanks @giordano!

tknopp commented 3 years ago

Ok, I have some findings and can even reproduce the issue locally. Here is my test program:

# file is called gui2.jl here locally
using Test, Gtk
using Gtk.ShortNames, Gtk.GConstants, Gtk.Graphics

numFailures = 0
for l=1:1000
  choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
  r = RadioButtonGroup(choices,2)
  active = [get_gtk_property(b,:active,Bool) for b in r]
  if sum(active) != 1
    @show active
    global numFailures += 1
  end
end

@show numFailures

So the issue is with the calls where the vector of active elements is obtained. Sometimes (not always) I get the following run:

julia> include("gui2.jl")

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.590: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed
active = Bool[0, 0, 0, 0, 0]
numFailures = 1
1

So it seems that something went wrong Gtk internally and then simply false is returned even for the active element.

tknopp commented 3 years ago

I do not get this error with Julia 1.5.

tknopp commented 3 years ago

So these sort of spontaneous errors are not the one I am capable of tracking down since this certainly touches the core of the type conversion in done in the glib module. The interesting question is what has changed from Julia 1.5 to 1.6 that can lead to these non-deterministic sort of errors. Ping @vtjnash who is much better prepared to solve this and ping @KristofferC to make you aware of this regression.

jonathanBieler commented 3 years ago

I suspect it's an issue with custom widgets since RadioButtonGroup is one (see also #557). Maybe there's some issues here ?

https://github.com/JuliaGraphics/Gtk.jl/blob/b61cb23845f5784ca6232b3f9afed40f1f91f065/src/GLib/gtype.jl#L489

timholy commented 3 years ago

@jonathanBieler, the RadioButtonGroup code doesn't use gobject_move_ref: https://github.com/JuliaGraphics/Gtk.jl/blob/d743a8bb429d6b23d92be415d71b4b6f2fda72c5/src/buttons.jl#L64

If I change the rhs to gobject_move_ref(new(layout), layout), I get this:

(julia:85406): GLib-GObject-CRITICAL **: 07:54:00.060: g_object_set_qdata_full: assertion 'G_IS_OBJECT (object)' failed

(julia:85406): GLib-GObject-CRITICAL **: 07:54:00.060: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.075: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.075: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.076: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.103: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.118: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

all immediately upon calling RadioButtonGroup(choices,2).

EDIT: the assertion 'G_IS_OBJECT (object)' failed are triggered at https://github.com/JuliaGraphics/Gtk.jl/blob/b61cb23845f5784ca6232b3f9afed40f1f91f065/src/GLib/gtype.jl#L494

and the assertion 'GTK_IS_CONTAINER (container)' failed by each push!: https://github.com/JuliaGraphics/Gtk.jl/blob/d743a8bb429d6b23d92be415d71b4b6f2fda72c5/src/buttons.jl#L74

timholy commented 3 years ago

OK, this diff gets us closer to the design suggested in https://juliagraphics.github.io/Gtk.jl/latest/manual/customWidgets/:

diff --git a/src/buttons.jl b/src/buttons.jl
index 4574ee9..e19b533 100644
--- a/src/buttons.jl
+++ b/src/buttons.jl
@@ -59,9 +59,9 @@ mutable struct GtkRadioButtonGroup <: GtkContainer # NOT a native @gtktype
     # the behavior is specified as undefined if the first
     # element is moved to a new group
     # do not rely on the current behavior, since it may change
-    handle::GtkContainer
+    handle::Ptr{GObject}
     anchor::GtkRadioButton
-    GtkRadioButtonGroup(layout::GtkContainer) = new(layout)
+    GtkRadioButtonGroup(layout::GtkContainer) = gobject_move_ref(new(unsafe_convert(Ptr{GObject}, layout)), layout)
 end
 const GtkRadioButtonGroupLeaf = GtkRadioButtonGroup
 macro GtkRadioButtonGroup(args...)
@@ -88,7 +88,7 @@ function push!(grp::GtkRadioButtonGroup, e::GtkRadioButton)
     else
         grp.anchor = e
     end
-    push!(grp.handle, e)
+    container_add!(grp.handle, e)
     grp
 end
 function push!(grp::GtkRadioButtonGroup, label, active::Union{Bool, Nothing} = nothing)
@@ -100,7 +100,7 @@ function push!(grp::GtkRadioButtonGroup, label, active::Union{Bool, Nothing} = n
     if isa(active, Bool)
         gtk_toggle_button_set_active(e, active::Bool)
     end
-    push!(grp.handle, e)
+    container_add!(grp.handle, e)
     grp
 end
 function start_(grp::GtkRadioButtonGroup)
diff --git a/src/container.jl b/src/container.jl
index a07ee41..b6609ee 100644
--- a/src/container.jl
+++ b/src/container.jl
@@ -1,9 +1,11 @@
+container_add!(w, child)    = ccall((:gtk_container_add, libgtk),    Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
+container_remove!(w, child) = ccall((:gtk_container_remove, libgtk), Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
 function push!(w::GtkContainer, child)
-    ccall((:gtk_container_add, libgtk), Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
+    container_add!(w, child)
     w
 end
 function delete!(w::GtkContainer, child::GtkWidget)
-    ccall((:gtk_container_remove, libgtk), Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
+    container_remove!(w, child)
     w
 end
 function empty!(w::GtkContainer)

Sometimes it succeeds, but I still get intermittent failures like

(julia:92982): GLib-GObject-CRITICAL **: 08:57:46.379: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:92982): GLib-GObject-CRITICAL **: 08:57:46.384: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed

that seem to be reduced by adding various sleeps. What's the right way to do this?

vtjnash commented 3 years ago

I think you might need a GC.@preserve w in places like https://github.com/JuliaGraphics/Gtk.jl/blob/d743a8bb429d6b23d92be415d71b4b6f2fda72c5/src/buttons.jl#L115

timholy commented 3 years ago

Ho ho! I hadn't noticed that, but that's a good catch.

I added it everywhere start_ gets used in buttons.jl (there are other uses for other types but I didn't try it there). I still get intermittent errors like the above.

timholy commented 3 years ago

I think you're seriously on to something, though: if I wrap https://github.com/JuliaGraphics/Gtk.jl/blob/d743a8bb429d6b23d92be415d71b4b6f2fda72c5/test/gui.jl#L314-L342

in GC.enable(false); ... ; GC.enable(true) it seems to pass every time.

timholy commented 3 years ago

There really does seem to be something different between 1.5 and 1.6. If you put

GC.gc(true); GC.gc(true); GC.gc(true)

right after the r = RadioButtonGroup(choices,2), then on 1.6 I get failures on alternating runs. (weird) It never seems to fail on 1.5.

As disclosure, I've also commented out most steps that run prior to the "radio" tests, and I've also removed the @testset.

diff --git a/test/gui.jl b/test/gui.jl
index 024af93..e5a9b81 100755
--- a/test/gui.jl
+++ b/test/gui.jl
@@ -2,7 +2,7 @@

 using Gtk.ShortNames, Gtk.GConstants, Gtk.Graphics
 import Gtk.deleteat!, Gtk.libgtk_version, Gtk.GtkToolbarStyle, Gtk.GtkFileChooserAction, Gtk.GtkResponseType
-
+#=
 ## for FileFilter
 # This is just for testing, and be careful of garbage collection while using this
 struct GtkFileFilterInfo
@@ -309,8 +309,8 @@ set_gtk_property!(check,:label,"new label")
 showall(w)
 destroy(w)
 end
-
-@testset "radio" begin
+=#
+# @testset "radio" begin
 choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
 w = Window("Radio")
 f = Gtk.GtkBox(:v); push!(w,f)
@@ -325,7 +325,9 @@ showall(w)
 destroy(w)

 r = RadioButtonGroup(choices,2)
+GC.gc(true); GC.gc(true); GC.gc(true)
 @test length(r) == 5
+GC.enable(false)
 @test sum([get_gtk_property(b,:active,Bool) for b in r]) == 1
 itms = Vector{Any}(undef,length(r))
 for (i,e) in enumerate(r)
@@ -335,12 +337,17 @@ for (i,e) in enumerate(r)
             e[1]
         end
 end
+GC.enable(true)
 @test setdiff(choices, itms) == [choices[4],]
 @test setdiff(itms, choices) == ["choice four",]
+GC.enable(false)
 @test get_gtk_property(get_gtk_property(r,:active),:label,AbstractString) == choices[2]
+GC.enable(true)
 w = Window(r,"RadioGroup")|>showall
 destroy(w)
-end
+# end
+# GC.enable(true)
+error("stop")

 @testset "ToggleButton" begin
 tb = ToggleButton("Off")
timholy commented 3 years ago

Another observation: if I combine the above with disabling GC while the RadioButtonGroup constructor runs, then I get the assertion 'G_IS_OBJECT (object)' failed on every run. So it seems that GC needs to run during construction?

timholy commented 3 years ago

I am now onto other deadlines, but to leave this in a good state, here's a MWE:

using Revise   # curiously, this seems to be important to making this fully reproducible, though CI fails without
using Gtk, Gtk.ShortNames, Test

choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
r = RadioButtonGroup(choices,2)
GC.gc(true); GC.gc(true); GC.gc(true)
@test length(r) == 5
@test sum([get_gtk_property(b,:active,Bool) for b in r]) == 1

This reliably passes on 1.5 and reliably fails on 1.6. If you comment out the GC.gc(true); GC.gc(true); GC.gc(true) line then it seems to pass most or all the time on 1.6. The outcome is repeatable even in the same session, you don't need a restart.

Both are running Gtk 1.1.6.

timholy commented 3 years ago

I've opened an issue (see link above) that identifies https://github.com/JuliaLang/julia/pull/38180 as the probable source of the change. What's really odd is that Gtk doesn't use WeakKeyDict (lots of WeakRefs though), but there were also changes to gc.c.

vtjnash commented 3 years ago

That changed WeakRef so they must be re-set by a finalizer if it wants to keep them active. E.g.

w = WeakRef(x)
finalizer(x) do x; keepalive(x) && w.value = x; end
timholy commented 3 years ago

Sorry, I don't understand. I don't find keepalive in Base, GC, Gtk, or Gtk.GLib. Are there any docs on this?

vtjnash commented 3 years ago

I suppose I should have just said the_rest_of_the_finalizer(x)

timholy commented 3 years ago

OK. It would be ideal if someone who understands memory management in Gtk (@jonathanBieler ?) fixes this here, and @vtjnash adds the requisite NEWS entry (https://github.com/JuliaLang/julia/issues/39811#issuecomment-786599415). If @jonathanBieler can't fix it I can tackle it, but I'd like to wait for @vtjnash's NEWS entry first since the enhanced clarity will save me time (which is in short supply).

jonathanBieler commented 3 years ago

@timholy Sadly I don't understand much about memory management in Gtk.jl or Julia for that matter.

vtjnash commented 3 years ago

I don't see an obvious reason it should behave different. Someone might want to run this under rr, so you watch where the g_object_unref calls happen and see where it should be fixed.

timholy commented 3 years ago

Sorry @jonathanBieler, I had you mixed up with with @bfredl who added much of the glib stuff but who hasn't been around a lot lately.

timholy commented 3 years ago

Unfortunately, every time I've tried to use rr in the last two years I get segfaults that don't happen without it:

$ rr record ./julia-debug --args /tmp/gtk.jl
rr: Saving execution to trace directory `/home/tim/.local/share/rr/julia-debug-0'.
Segmentation fault
tknopp commented 3 years ago

( I know this does not change anything, but a big "thank you" for investigating Tim )