awesomeWM / awesome

awesome window manager
https://awesomewm.org/
GNU General Public License v2.0
6.39k stars 597 forks source link

Something left on the lua stack, lgi async #2832

Open DorianGray opened 5 years ago

DorianGray commented 5 years ago

Output of awesome --version:

awesome v4.3-314-g1116bc47 (Too long) • Compiled against Lua 5.1.4 (running with LuaJIT 2.0.5) • D-Bus support: ✔ • xcb-errors support: ✘ • execinfo support: ✔ • xcb-randr version: 1.6 • LGI version: 0.9.2

How to reproduce the issue: I hate callbacks

local lgi = require 'lgi'
local awful = require 'awful'
local awesome = require 'awesome'
local gears = require 'gears'

local function read_stream(st)
  local unix_stream = lgi.Gio.UnixInputStream.new(st, true)
  local stream = lgi.Gio.DataInputStream.new(unix_stream)

  local output = {}
  while not stream:is_closed() do
    local line, len = stream:async_read_line()
    if line then
      table.insert(output, line)
    else
      stream:async_close()
    end
  end
  return table.concat(output, '\n')
end

local function run(command, shell)
  if not shell then
    shell = false
  elseif shell == true then
    shell = awful.util.shell
  end
  local pid, _, stdin, stdout, stderr = awesome.spawn(
    shell and {shell, '-c', command} or command,
    false,
    false,
    true
  )

  if type(pid) == "string" then
    return pid
  end

  return lgi.Gio.Async.call(gears.protected_call)(read_stream, stdout)
end

run('cat /proc/net/wireless', true)

The above code works fine with no errors, but if I run the last line inside of a gears timer...

gears.timer({
  timeout=10,
  autostart=true,
  callback=function()
    run('cat /proc/net/wireless', true)
  end,
})

Actual result:


2019-07-20 23:06:29 W: awesome: a_glib_poll:451: Something was left on the Lua stack, this is a bug!
-------- Lua stack dump ---------
2: userdata     #16     0x419ac128
1: userdata     #208    0x419b3128
------- Lua stack dump end ------
2019-07-20 23:06:29 W: awesome: a_glib_poll:451: Something was left on the Lua stack, this is a bug!
-------- Lua stack dump ---------
2: userdata     #16     0x41c8d270
1: userdata     #208    0x419b3128
------- Lua stack dump end ------
2019-07-20 23:06:29 W: awesome: a_glib_poll:451: Something was left on the Lua stack, this is a bug!
-------- Lua stack dump ---------
2: userdata     #16     0x402543a8
1: userdata     #208    0x419b3128
------- Lua stack dump end ------
2019-07-20 23:06:29 W: awesome: a_glib_poll:451: Something was left on the Lua stack, this is a bug!
-------- Lua stack dump ---------
2: userdata     #16     0x41a4df80
1: userdata     #208    0x419b3128
------- Lua stack dump end ------
2019-07-20 23:06:29 W: awesome: a_glib_poll:451: Something was left on the Lua stack, this is a bug!
-------- Lua stack dump ---------
2: userdata     #16     0x400ddb78
1: userdata     #208    0x419b3128
------- Lua stack dump end ------
2019-07-20 23:06:29 W: awesome: a_glib_poll:451: Something was left on the Lua stack, this is a bug!
-------- Lua stack dump ---------
2: userdata     #16     0x4026c330
1: userdata     #208    0x419b3128
------- Lua stack dump end ------

Expected result:

I would expect no errors, but my code seems to work fine regardless.

DorianGray commented 5 years ago

PS:: lgi is latest master, as is awesome This also happens in 9.0 and 4.3, respectively

DorianGray commented 5 years ago

after leaving the code running overnight, there doesn't seem to be any memory leaking or any ill effects aside from the log spam.

psychon commented 5 years ago

Offtopic:

local awesome = require 'awesome'

The above only works on Lua 5.1. If I recall correctly, this is because of this code:

static inline void
luaA_registerlib(lua_State *L, const char *libname, const luaL_Reg *l)
{
    assert(libname);
#if LUA_VERSION_NUM >= 502
    lua_newtable(L);
    luaL_setfuncs(L, l, 0);
    lua_pushvalue(L, -1);
    lua_setglobal(L, libname);
#else
    luaL_register(L, libname, l);
#endif
}

luaL_register does more than our Lua 5.2+ alternative to it.


TL;DR: The warning itself is harmless and wrong. There was nothing left on the Lua stack. The check for "was something left on the Lua stack" has a bug.


From a quick look, the issue here is not that something was left on the Lua stack, but that the check for "was something left on the Lua stack" is broken.

I added static buffer_t buffer = BUFFER_INIT; backtrace_get(&buffer); printf("%s\n", buffer.s); next to the code that warns about something being left on the Lua stack. The result is:

/home/psychon/projects/awesome/build/awesome(backtrace_get+0x47) [0x55b4f88de4d8]
/home/psychon/projects/awesome/build/awesome(+0x13f5f) [0x55b4f88c9f5f]
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(+0x4e136) [0x7fe41f17e136]
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_main_loop_run+0xb2) [0x7fe41f17e4c2]
/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe41dc2d8ee]
/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7fe41dc2d2bf]
/usr/lib/x86_64-linux-gnu/lua/5.3/lgi/corelgilua51.so(+0xa881) [0x7fe41bcf0881]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xd585) [0x7fe41e445585]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0x1a355) [0x7fe41e452355]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xd998) [0x7fe41e445998]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xd9c1) [0x7fe41e4459c1]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xcd82) [0x7fe41e444d82]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xdc4b) [0x7fe41e445c4b]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(lua_pcallk+0x74) [0x7fe41e441594]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0x1ef80) [0x7fe41e456f80]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xd585) [0x7fe41e445585]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0x1a355) [0x7fe41e452355]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xd998) [0x7fe41e445998]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xd9c1) [0x7fe41e4459c1]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xcd82) [0x7fe41e444d82]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(+0xdc4b) [0x7fe41e445c4b]
/usr/lib/x86_64-linux-gnu/liblua5.3.so.0(lua_pcallk+0x74) [0x7fe41e441594]
/usr/lib/x86_64-linux-gnu/lua/5.3/lgi/corelgilua51.so(+0x9164) [0x7fe41bcef164]
/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_closure_unix64_inner+0x19b) [0x7fe41dc2d6db]
/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_closure_unix64+0x46) [0x7fe41dc2da56]
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(+0x4e863) [0x7fe41f17e863]
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_main_context_dispatch+0x158) [0x7fe41f17ddd8]
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(+0x4e1c8) [0x7fe41f17e1c8]
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_main_loop_run+0xb2) [0x7fe41f17e4c2]
/home/psychon/projects/awesome/build/awesome(main+0x1582) [0x55b4f88cb981]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fe41e24609b]
/home/psychon/projects/awesome/build/awesome(_start+0x2a) [0x55b4f88c9caa]

At the bottom of the stack, there is main and the stuff that comes before it. Then the main loop runs g_main_loop_run and dispatches some source (I guess: the timer fires). The timer then runs some Lua code (lots of frames in liblua). At the end, there is a call into LGI's C code again and another call to g_main_loop_run(). There aren't many calls to GLib.MainLoop() or GLib.MainLoop.new() in LGI. In fact, I am sure that this is the main loop created by Gio.Async.call().

Now we have two main loops running: The one created by Awesome and the one created by LGI. This is totally fine for GLib. However, the check for "something was left on the Lua stack" no longer makes any sense. I think it is totally fine to write some code that pushed something on the Lua stack, calls some Lua code, and then cleans up the stack. However, if this intermediate Lua code ends up starting a GLib main loop, we get some recursion and the C code in a_glib_poll does not expect such recursion.

The way around this seems to be if (g_main_depth() == 0) { /* old code here */ }. However, I am not quite sure how much of the code should be wrapped like this. Touching this code could easily break someone's config, but is that okay? For example... well, I don't know. I wasn't thinking about recursion when writing this code. There even is an assert in there which could randomly fire and kill your whole WM due to this recursion. So, the warning is just one of the possible symptoms. Additionally, this kind-of breaks the "Last main loop iteration took too long" code.

Anyway, the easiest fix would be something like this:

diff --git a/awesome.c b/awesome.c
index 413b64c90..bf91c69b5 100644
--- a/awesome.c
+++ b/awesome.c
@@ -443,6 +443,9 @@ a_glib_poll(GPollFD *ufds, guint nfsd, gint timeout)
     int saved_errno;
     lua_State *L = globalconf_get_lua_State();

+    if (g_main_depth() > 0)
+        return g_poll(ufds, nfsd, timeout);
+
     /* Do all deferred work now */
     awesome_refresh();

This just gets rid of all the extra magic that I hacked into GLib's main loop when we are in a recursive call. This means that recursive calls will stop all XCB event processing, will no longer emit the refresh signal, will no longer perform delayed work etc. This has the potential of breaking some config out there.

DorianGray commented 5 years ago

lgi is creating a separate event loop for each async call context?

I've been extending my config with more and more async contexts as I go. I have seen 0 crashes since I started, but more and more log spam. Am I just really lucky?

psychon commented 5 years ago

Creating a GMainLoop instance is totally fine. This is how e.g. GTK makes modal dialogs work: The function showing the modal dialog creates a new main loop, runs it, and stops it when it is done.

A main loop is just something that you can call :quit() on to make it exist its :run() method. The "list of event sources" is called a main context in GLib. It is totally fine to have multiple main loops all running the same main context (as long as all of that happens in the same thread - two threads running a main loop for the same main context would be bad).

The problem here is that awesome overrides GLib's poll() method with another function. This other function (a_glib_poll) then does lots of stuff in ways that do not fit with GLib's main loop. Also, this function is not written in a way that works with these recursive main loops.

So, short version: GLib is totally fine with what LGI is doing, awesome's C code needs fixing.

I have seen 0 crashes since I started, but more and more log spam. Am I just really lucky?

Well... nothing in there should "obviously" cause crashes.

The warning from awesome with "something was left on the Lua stack" is followed with lua_settop(L, 0). This can cause crashes, because it removes things from the Lua stack that something else in the call chain (LGI?) still might want to use. I bet that if you compiled Lua with assertions enabled, then this would be caught. Without it, it just happens to work as long as the garbage collector does not clean up (which is very, very unlikely to happen).

Also, there an assert(globalconf.pending_event == NULL) in a_glib_poll. This is basically "the previous event loop iteration finished handling all events that came from the X11 server". With recursive main loops, anything involving "the previous iteration finished" is dubious.

I admit that I am taking the easy way out when I say "don't do what you are doing". This kind of recursion is just something that never crossed my mind before and that can likely break lots of otherwise fine code. I guess I am afraid of ending up with an undebuggable mess involving a recursion depths of 10...

psychon commented 5 years ago

lgi is creating a separate event loop for each async call context?

https://github.com/pavouk/lgi/blob/a7e928e769778d25b73845e5fdf160216cd17070/lgi/override/Gio.lua#L66-L99

(Also compare this with Async.start just before this function; .call is basically doing the same + creating an event loop and an idle event source.)

DorianGray commented 5 years ago

I see what you're saying.

Yeah, the biggest problem with async code is the lack of a straightforward call stack...things get lost in the event loop. That said, I think callbacks are worse :man_shrugging:

DorianGray commented 5 years ago

Just for fun I tested your patch and unfortunately I'm still seeing the exact same log messages. Is it possible that glib is treating the second loop as just more events to handle, and finishing up just the same?

I'll dig deeper later, regardless, but it's almost 2am here now.

psychon commented 5 years ago

I just tested and with the code from the first post of this issue: I definitely do not see any warnings any more. And yes, I do see warnings without the patch with g_main_depth. (By the way: Thanks for writing good, self-contained bug reports!)

DorianGray commented 5 years ago

There seems to be less of the messages but I'm still getting them. My hunch is that it's from creating nested loops. My config is more complex then the case above, but I also have the beginnings of some nice async process, file, and stream management apis. I'm taking a break today but I'll see about making more progress tomorrow.

psychon commented 5 years ago

Another piece of code which does not expect recursive main loops:

https://github.com/awesomeWM/awesome/blob/b475c23fd38dc28048bfa979528fd63f512f32e3/awesome.c#L386-L423

A recursive call to a_xcb_check() could mess up the order in which events are handled. If we want to allow "semi-synchronous" code via require("lgi").GLib.MainLoop():run(), then this code has to be changed somehow. Either the mouse variable becomes static, or we do not process X11 events in recursive calls (which would basically mean "we do not allow Lua to start main loops").

DorianGray commented 5 years ago

What about only processing x events in the main loop started by awesome and ignoring the rest? Based on testing I've done, events in the first loop are still processed while the second loop still has events to process.

psychon commented 5 years ago

Yes, all of this is fixable, but it requires patches and it requires awareness about the possible reentrancy.

I guess I would fix this code by basically making its automatic variables static. That way, the recursive call will operate on the same events as the previous invocation of the code. (Well, I would rather move the variables to globalconf, but the idea with the static variables is equivalent.)