chris-montero / terra

A sane application development framework for the desktop.
116 stars 5 forks source link

Crash on resize of window #1

Closed NikolasDaynard closed 4 weeks ago

NikolasDaynard commented 1 month ago

Okay so I've been going a bit mad over here trying to build terra under Nixos. I made a shell.nix and installed the dependencies, then I tried to run the test in the shell. Then I had to patch terra because it was including "stdcolor", but I belivie it should be terra.tools.color. Anyway after that running it, it just spams this error message every time ( I assume ) a redraw happens

----------------- RUNTIME ERROR -----------------
/home/me/.luarocks/share/lua/5.1/terra/window/xcb.lua:35: attempt to index field 'lgi_cairo_surf' (a nil value)
stack traceback:
    [C]: in function '__index'
    /home/me/.luarocks/share/lua/5.1/terra/window/xcb.lua:35: in function '_window_drawing_context_destroy'
    /home/me/.luarocks/share/lua/5.1/terra/window/xcb.lua:106: in function 'draw'
    /home/me/.luarocks/share/lua/5.1/terra/window/xcb.lua:299: in function 'handle_frame_event'
    ...me/.luarocks/share/lua/5.1/terra/events/common.lua:12: in function <...me/.luarocks/share/lua/5.1/terra/events/common.lua:9>
    [C]: in function 'desktop'
    /home/me/.luarocks/share/lua/5.1/terra/app.lua:24: in function 'desktop'
    main.lua:66: in main chunk
    [C]: at 0x004062d0

So when clearing the window, it finds that window.lgi_cairo_surf is nil and errors. I checked and at creation, (to my knowledge) the assertion passes. Removing the root node stops the error.

System:

100% this could be my x11 being broken but I really don't know enough about the backend of this project to say what the issue is. I would really appreciate if you could take a look at what the issue might be.

20240809_17h56m58s_grim (left is terminal, right is terra app)

NikolasDaynard commented 1 month ago

Okay! update for my own sanity. So it appears that the dynamic resizing of hyprland windows doens't sit right with cairo, and the window height or width are nil. By setting this to 100 and 100 on nil it will actually render initially. However on a redraw? it crashes with the error.

luajit: ../src/cairo-surface.c:962: cairo_surface_destroy: Assertion `CAIRO_REFERENCE_COUNT_HAS_REFERENCE (&surface->ref_count)' failed. Aborted (core dumped)

NikolasDaynard commented 1 month ago

Alright I fixed it hackily by removing the destry in the window draw. It technically works, (though it might be leaking a ton of memory)

Okay I just checked again, the error where I have to nil check the window size only happens once on startup. Something to keep in mind

Alright so looking in the scairo destroy c funtion where the error occurs, it cleans up the ref_count is < 0 causing the assertation failure, crashing

    /* paranoid check that nobody took a reference whilst finishing */
    assert (! CAIRO_REFERENCE_COUNT_HAS_REFERENCE (&surface->ref_count));
chris-montero commented 1 month ago

Hey, thank you very much for the interest in the project.

I am not sure myself what exactly is the error, and right now I'm pressed for time, but will gladly assist you with debugging as I get time, tonight or tomorrow. It'd be great to get this working fast on Wayland, even if it's just through xwayland.

So just so I understand: you got the code to render initially if you set the window width and height to 100, 100? And then, as you resize the window, the program crashes? And yes, removing the destroy function in the window draw function will probably just keep leaking memory.

I don't have time to investigate right now, but looking in the c/src/scairo.c file is probably where I would start too. I also just pushed a patch for your issue with stdcolor.

NikolasDaynard commented 1 month ago

Ah thank you, totally understand not having a lot of time. It's not on resize it fails it's failing inconsistently every 2 or 4 calls to the resize (rarely 4). Hyprland resizes the windows a lot, but it seems to also call the resize (where pixmap isn't the same as window width / height) seemingly randomly, again could just be hyprland nonsense I don't know.

So it's crashing on the resize check but I have no idea why ref_count would be nil going into the destroy function though. I'm guessing that hyprland forcing the window to resize is making it unhappy.

Okay so I just got back from testing, I made a new function to return the number of cairo refs and it was 3 every time I called the resize callback, so my suspicion is that the failure isn't happening in the resize, but then removing the destroy makes it stop crashing. It's very weird.

EDIT: oh it's late I can't read, so somewhere in the destroy function defined in cairo-surface it just sorta loses the ref. It's not even the code erroring it's just the library somehow.

EDIT2:

The subscribing movement of the example seems to speed up the crash a LOT, (reducing draw calls from every frame)

chris-montero commented 1 month ago

Sounds very weird. I'll try to get hyprland and test this myself and come back to this issue. But the problem does seem to happen with the l/window/xcb.lua file, with the draw, _window_drawing_context_destroy and _window_drawing_context_create functions.

Until I try myself, you can do this:

  1. pull the latest commit from the master branch
  2. Then, use this sample code:
    
    #!/usr/bin/env luajit

local t_app = require("terra.app") local t_window = require("terra.window." .. t_app.get_platform())

local tt_color = require("terra.tools.color")

local to_size = require("terra.oak.size") local to_align = require("terra.oak.align")

local toe_element = require("terra.oak.elements.element")

local toeb_root = require("terra.oak.elements.branches.root") local toeb_el = require("terra.oak.elements.branches.el")

local toel_bg = require("terra.oak.elements.leaves.bg") local toel_text = require("terra.oak.elements.leaves.text")

local function init_app(app)

local model = {}
app.model = model

-- create the window
model.main_window = t_window.create(app, 320, 420, 200, 160, {

    -- override the window's default draw function
    draw = function(window)
        if window.should_draw == false then return end
        local drawn = t_window.draw(window)
        if drawn then 
            window.should_draw = false
        end
    end,
    should_draw = true,

    -- subscribe to get mouse click events on window
    subscribe_on_self = {
        ["MouseClickEvent"] = function(self, is_press, button, modifiers, x, y)
            if is_press == false then return end

            -- mark the UI root for redraw. will get redrawn 
            -- in the next frame.
            toe_element.mark_redraw(self.tree)
            -- mark the window for redraw
            self.should_draw = true
        end
    },

    tree = toeb_root.new({ -- the root of the UI tree
        toeb_el.new({ -- the background of the window
            width = to_size.FILL,
            height = to_size.FILL,
            bg = toel_bg.new({
                source = tt_color.rgb(0.17, 0.42, 0.21), -- green
            }),
            toeb_el.new({ -- the red ball
                halign = to_align.CENTER,
                valign = to_align.CENTER,
                width = 60,
                height = 60,
                bg = toel_bg.new({
                    source = tt_color.rgb(0.8, 0.1, 0), -- red
                    border_radius = 30,
                }),
                subscribe_on_root = { 
                    ["AnimationEvent"] = function(self, time)
                        local spin_push = 20
                        self.offset_x = math.sin(time) * spin_push
                        self.offset_y = -math.cos(time) * spin_push
                    end
                },
                -- declaratively subscribe to signals on elements (very powerful feature)
                toel_text.new({ -- the "hello" text
                    family = "Roboto",
                    size = 11,
                    width = 40, -- constrain the textbox so the text wraps
                    halign = to_align.CENTER,
                    valign = to_align.CENTER,
                    text = "hello world",
                    fg = tt_color.rgb(1, 1, 1),
                })
            }),
        }),
    }),
})

t_window.request_raise(model.main_window)

end

t_app.desktop(init_app, t_app.make_default_event_handler(function(app, event_type, ...) end))



This will only redraw the window once when creating the widow, and afterwards only you click on it. The data of the red ball will keep "spinning", the window will just not get drawn until you click on it.
This way you're in control of when the drawing happens.

3. From here, I would try to resize the window, click on it to make it redraw, and print out all values in the `_window_drawing_context_destroy` and `_window_drawing_context_setup` functions and see if anything is broken.
NikolasDaynard commented 1 month ago

Okay this is super weird,

        print("resize")
        _window_drawing_context_destroy(window)
        _window_drawing_context_setup(window, window_geom.width, window_geom.height)t
        print("resize end")

I replaced the drawing resize if with this, and this actually reaches the resize end before crashing with the error in the destroy function. I think the error is on a separate thread, making debugging pretty awful. Do you know where the tree:draw is defined? I can't find it anywhere.

Removing tree:draw() ~gets rid of the crash though~ makes the crash less likely, (threading issue?).

So it 's not crashing on the resize callback, now that I have to on click, but commenting out the destroy in the resize function does stop the crash. So I think that somehow the destroy is breaking it somehow and then randomly on the draw call, it will notice this and panic?

notes

chris-montero commented 1 month ago

The thing that's most unclear to me is why the lgi_cairo_surf value is nil in the very first picture you posted. That should never happen. If it does, there's some error that I don't think is coming from Terra.

Don't try to hackily fix it by removing the _window_drawing_context_destroy function. Instead, I would start with the master branch's latest commit, go in _window_drawing_context_setup, and try to print all values there. Start the program, click on the window to draw it once, then resize, click again and see what it prints. Then, if lgi_cairo_surf is nil, that's the point from which to start investigating.

By the way, are you using luaJIT? I only plan on supporting luajit.

In any case, this error seems very weird. I have to get wayland and I'll try to replicate this issue and respond as soon as possible.

Also, there should not be any threading issue from the side of Terra, as lua is single threaded, and my C code doesn't do any multi-threading.

NikolasDaynard commented 1 month ago

So the whole reason that lgi_cairo_surf was nil was because the height received window:get_geometry() initally was nil. Though, now, (of course) I can't seem to reproduce lgi_cairo_surf being nil because of it. It seems like pulling down the latest changes fixed this (need to check). But all the numbers look totally right in the creation it just seems like at some point in the drawing it just sorta gives up for (seemingly) no reason.

pix: 6291481
cairo_surf_ptr: 18773024
lgi.rec 0x011e7420:cairo.Surface
lgi.rec 0x011e7870:cairo.Context

I am using luajit. Apologies for forgetting to pull the code earlier I was messing up the merge and it slipped me.

chris-montero commented 1 month ago

the height received window:get_geometry() initally was nil.

That should never be nil. I just checked and you're correct. There was a bug with setting window geometry upon creation here. I just pushed a patch for this.

Also, I couldn't get hyprland because I'm on void and I won't manually compile it, but I got sway to work, tested my weather app and tried to stress test the app by resizing it constantly and quickly, and after a few seconds I got an error about how the tried to draw onto an already-finished cairo surface (which should never happen. I explicitly designed the code to never have that problem.), and after a few seconds of continuing to resize the window, it crashed with a segfault.

I tried to open the application again, resize it, and after a while it just segfaulted again.

I tested this code on xfce to see if this constant resizing would crash there as well, and it does, but after more time. I again, get a segfault. Sometimes I get a cryptic "corrupted double-linked list (not small)" error message.

Since this issue happens on both x11 and wayland, it's probably an issue with terra.

I have the coredump file generated from this segfault, but I don't yet know how to use it for debugging. I've never had to debug lua programs that import a ".so" file.

If you know how to progress in debugging this, your help will be greatly appreciated. If not, I'll leave this issue open until I find out myself how to fix it.

xezo360hye commented 1 month ago

Please change the title to something like "Apps crash on resize", as we know that the bug is present on all WMs

chris-montero commented 1 month ago

Ok, I just pushed changes to a new branch: test_fix_for_resize_crash.

This simplifies the drawing code a bit and doesn't create and destroy pixmaps every time a window is resized.

I tested this code on my machine and it seems like the crash doesn't happen anymore. Please test this on your system as well and let me know if the crash still happens.

chris-montero commented 1 month ago

Also, leaving this for future reference: I found out how to use gdb to debug a lua program that imports .so files.

  1. gdb
  2. (gdb) set debug jit 1
  3. (gdb) target exec /usr/bin/luajit
  4. (gdb) run ./main.lua
xezo360hye commented 1 month ago

I tested this code on my machine and it seems like the crash doesn't happen anymore. Please test this on your system as well and let me know if the crash still happens.

Confirm from KDE Plasma. Tried moving around screens, resizing, maximize, fullscreen, all good. Also how about some kind of debug parameter for the library so we can get logs in the future without modifying the code, and only passing an option (or maybe even runtime environment variable)?

chris-montero commented 1 month ago

Ok, just authored a new branch: fix_for_resize_crash_properly

This uses pixmaps to draw, then copies the contents to the window when the drawing is done. When the window size changes, it just creates a new pixmap and sets it to the cairo surface, thereby no longer requiring to destroy everything and re-create it, which means in theory it should also be faster. Using pixmaps is also better because no flickering happens, unlike drawing to the window directly.

I tested this on my computer and no crash happened after heavily stress testing the application by resizing it a lot.

Please test these changes on your computers as well, and if the issue no longer happens, I'll merge the changes and close this issue.

And yes, @xezo360hye you're right. The C code already has some macros around like #ifdef DEBUG // etc, and you can see that commented out in the rockspec file, but I do want some better infrastructure for debugging, especially for issues like these.

However, as long as there's no more serious issues like this one, I don't plan on focusing on the C code for a while. I want to work and solidify the lua api exposed to the user as quickly as possible, and then I'll focus on optimizing and improving the C code.

xezo360hye commented 1 month ago

I tested this on my computer and no crash happened after heavily stress testing the application by resizing it a lot.

Yeah no crashes too but it constantly spams errors like below and doesn't show anything in the window

XCB ERROR:
        error_code: 143
        sequence: 3162
        resource_id: 18874379
        minor_code: 23
        major_code: 139
XCB ERROR:
        error_code: 9
        sequence: 4789
        resource_id: 18874377
        minor_code: 0
        major_code: 62
NikolasDaynard commented 1 month ago

The new code works completely on my computer, but I'm not getting XCB ERROR codes, it's working fine. I was getting those earlier when trying to do a workaround on the crash earlier but even then it was still rendering.

chris-montero commented 1 month ago

@xezo360hye if you look here, uncomment that, and you should get much more human readable XCB errors printed out. Compile it with that and tell me what it says then. Maybe you don't have some library installed or something.

chris-montero commented 1 month ago

@NikolasDaynard sick, if we can fix @xezo360hye 's issue too I'm closing this issue

xezo360hye commented 1 month ago

Ok first of all I'm sorry for the delay I completely forgot about this project's existence. When I run it with debug option it shows this on stdout:

Found depth_iterator with depth 24. Skipping.
Found depth_iterator with depth 1. Skipping.
Found depth_iterator with depth 4. Skipping.
Found depth_iterator with depth 8. Skipping.
Found depth_iterator with depth 15. Skipping.
Found depth_iterator with depth 16. Skipping.
Successfully found depth_iterator with depth 32.
default gc window id: 18874369
override_redirect: 0
NO NEED TO DRAW BECAUSE NOT SHOWING

On stderr though there are many different things. First this error appeared 5 times:

----------------- RUNTIME ERROR -----------------
/home/andrey/.luarocks/share/lua/5.1/terra/window/xcb.lua:64: bad argument #2 to 'create' (number expected, got nil)
stack traceback:
        [C]: at 0x7f6355722f40
        [C]: in function 'create'
        /home/andrey/.luarocks/share/lua/5.1/terra/window/xcb.lua:64: in function '_window_drawing_context_update'
        /home/andrey/.luarocks/share/lua/5.1/terra/window/xcb.lua:85: in function 'draw'
        /home/andrey/.luarocks/share/lua/5.1/terra/window/xcb.lua:267: in function 'handle_visibility_event'
        /home/andrey/.luarocks/share/lua/5.1/terra/events/xcb.lua:111: in function </home/andrey/.luarocks/share/lua/5.1/terra/events/xcb.lua:109>
        [C]: in function 'desktop'
        /home/andrey/.luarocks/share/lua/5.1/terra/app.lua:24: in function 'desktop'
        test.lua:66: in main chunk
        [C]: at 0x00406340

After it appears this one just a single time:

----------------- RUNTIME ERROR -----------------
/home/andrey/.luarocks/share/lua/5.1/terra/element.lua:27: attempt to compare number with nil
stack traceback:
        [C]: in function '__lt'
        /home/andrey/.luarocks/share/lua/5.1/terra/element.lua:27: in function 'contains_point'
        .../.luarocks/share/lua/5.1/terra/oak/elements/internal.lua:179: in function 'element_recursively_get_elements_under_point'
        .../.luarocks/share/lua/5.1/terra/oak/elements/internal.lua:203: in function 'element_get_approved_mouse_hit_elements'
        ...rocks/share/lua/5.1/terra/oak/elements/branches/root.lua:112: in function 'handle_mouse_enter_event'
        /home/andrey/.luarocks/share/lua/5.1/terra/window/xcb.lua:187: in function 'handle_mouse_enter_event'
        /home/andrey/.luarocks/share/lua/5.1/terra/events/xcb.lua:53: in function </home/andrey/.luarocks/share/lua/5.1/terra/events/xcb.lua:51>
        [C]: in function 'desktop'
        /home/andrey/.luarocks/share/lua/5.1/terra/app.lua:24: in function 'desktop'
        test.lua:66: in main chunk
        [C]: at 0x00406340

At the same time it starts sending

X error: request - FreePixmap (major 54, minor 0); error - (4) Pixmap; resource_id - (18874377)

which is then replaced by the following messages repeated (in order):

X error: request - Render-CreatePicture (major 139, minor 4);   error - (9) Drawable;   resource_id - (18874377)
X error: request - Render-FillRectangles (major 139, minor 26);         error - (143) Render-Picture;   resource_id - (18874379)
X error: request - Render-FillRectangles (major 139, minor 26);         error - (143) Render-Picture;   resource_id - (18874379)
X error: request - Render-Trapezoids (major 139, minor 10);     error - (143) Render-Picture;   resource_id - (18874379)
X error: request - Render-CompositeGlyphs8 (major 139, minor 23);       error - (143) Render-Picture;   resource_id - (18874379)
X error: request - Render-CompositeGlyphs8 (major 139, minor 23);       error - (143) Render-Picture;   resource_id - (18874379)
X error: request - CopyArea (major 62, minor 0);        error - (9) Drawable;   resource_id - (18874377)

and they continue until I kill the test program

chris-montero commented 4 weeks ago

@xezo360hye ok, I just merged the changes from the resize crash branch, along with some extra changes I made over the past few days.

Your problem seems like it stems from the fact that for some reason the window does not have its width/height set properly. Please switch to the master branch, pull all the changes, and test again with the example program I provided in the README. I'm closing this issue as your problem seems to be different than the original problem.

If the issue keeps happening, feel free to open up another issue and I'll help you there.