cmsj commented 3 years ago

So, this is not the first time I've decided I'm determined to get to the bottom of why we always have some "weird" crash reports, and it probably won't be the last time, but I'd like to discuss it again anyway.

Tagging @asmagill for his excellent thoughts.

Here are the "weird" crash reports received thus far for 0.9.84:

Here are some evidence points I have so far:

They all start with an OS callback that leads to us calling protectedCallAndTraceback. There are a couple of others that happen during the first load of init.lua, but we'll ignore those for now.
Generally speaking they are all something to do with pointers not being what Lua is expecting (either NULL or garbage), which strongly suggests (imo) that this is some kind of lifecycle issue - ie they're storing a reference to something that's been subsequently deallocated.
We don't get very many of these - in the week since 0.9.84 was released these particular crashes only number around 20, out of thousands of user sessions. Sentry doesn't break down a ton of detail here (in part because we deliberately don't enable active session tracking), but they report well over 99% of sessions are crash free. Still, it annoys me to know that these crashes all hint at some kind of lifecycle bug we haven't found yet.
Looking at the backtraces, they mostly fall into either hs.timer or hs.eventtap callbacks, but there are others, which suggests to me that it's not a specific bug in either of those modules, but rather a systemic flaw.
We have protections against (at least):
- trying to call Lua off the main thread
- lua_State having been replaced since the callback was set up, at least in hs.timer (via the relatively new checkLuaSkinInstance in 0.9.83)
- Lua stack overgrowth via the _lua_stackguard macros
Looking at the breadcrumb logs from the crashes, it's not generally the case that the crash happens imminently after a LuaSkin reload
Because they all start from C, they call [LuaSkin sharedWithState:NULL] which guarantees them the main Lua State, they're not coming in on a co-routine state (this kind of crash report predates co-routines anyway AFAIK, but we don't have historical data to back that up). This would also be covered by the LuaSkin UUID check, at least in hs.timer.

So, it seems relatively safe to say that these aren't happening when Lua is being torn down, and their relative consistency makes me suspect they're not stack/heap corruption, although I can't rule that out. I have some hypotheses:

They are happening at some transition period between a co-routine lua_State and the main lua_State (I have no idea how this works under the hood, maybe it's not possible)
Somewhere we're running the Objective C event loop in one of our modules, causing it to advance and process incoming events while we're in the middle of a Lua->C situation and the Lua state gets confused
We just have one or more memory corruption bugs in some C code somewhere
We have a very subtle Lua API interaction bug
We're hitting genuine bugs in Lua (seems unlikely that they would persist for this long without other much fancier users of Lua, noticing)

My bet is on 2, 3 or 4, but I don't have any evidence to support that yet.

So....... thoughts? :)

cmsj commented 3 years ago

I'm genuinely considering investigating what it would take to ship a single release with all of the malloc/zombies/sanitiser debugging features enabled, just to see if the nature of these reports changes (ideally in a way that gives us more information to go on).

Edit: Turns out our release builds have partial ASAN enabled, Main Thread Checker, and Zombie Checker, enabled already.

cmsj commented 3 years ago

Since we're apparently already paying the (approx 2x) performance cost of the Address Sanitiser in release builds, #2760 enables some of the additional options. I think we should carry these all for a single release that lasts 1-2 weeks and see what, if any, difference it makes to the quantity/quality of the crash reports we get, then revert the change. Any objections?

asmagill commented 3 years ago

No objections; It will take me a little longer to go through the items above and the list of crashes to see if I have any immediate insights, but as to capturing more detail in crashlogs, I'm all for seeing if this helps.

asmagill commented 3 years ago

They are happening at some transition period between a co-routine lua_State and the main lua_State (I have no idea how this works under the hood, maybe it's not possible)

Coroutines that work with the Hammerspoon modules are a pretty new thing... and even so, I doubt more than a handful of our users are using them yet. And even coroutine lua code runs on the main thread, so until a coroutine yields, which is the proper lua way of transitioning out of a coroutine, a queued event callback can't run.

Somewhere we're running the Objective C event loop in one of our modules, causing it to advance and process incoming events while we're in the middle of a Lua->C situation and the Lua state gets confused

Unless they're using my hs._asm.extras module (which I don't hide, but I don't promote, either... it's usually where I try out oddball things to see if they are worth pursuing further in a more official place), I don't think so... unless it's a side effect of another Objective-C operation, but I'd expect that to be clearly noted in the API docs.

Once I got co-routines working, I abandoned my testing with advancing the main event loop -- it introduces too much uncertainty, in my opinion.

I'm in agreement that 3 or 4 seem most likely, though.

I've also noticed in some of my own debugging that occasionally if invalid data is passed into an Objective-C method, it might be as many as 10 levels deeper in the traceback before the actual crash -- for example passing in a NULL value to a method marked non-nullable -- it's not always the called method but something it calls (or that calls, or...) before the crash occurs. It might be worth biting the bullet and addressing all of the warnings generated when building Hammerspoon... I know most are already set to cause an error during build, but not all are.

Would it be worth renaming all of the internal.m files so that even in a minimalist crashlog it's more clear which specific module contains the crashing code? It'd be an annoying change requiring modifications to almost all module files, lua and Objective-C, but maybe its time to bite that bullet as well?

cmsj commented 3 years ago

Couple of random thoughts while doing the dishes:

Lua does a lot of work before it calls a function, we essentially just call _pcall() - are we missing a trick there?
If we wrap debug.traceback we could capture protectedCallAndTraceback tracebacks as breadcrumbs. Not sure that this would be useful, but an extra datapoint might reveal an error pattern preceding these crashes

asmagill commented 3 years ago

lua_pcall is supposed to do that work for us, including guaranteeing a return (rather than a long-jump) on error.

Wrapping debug.traceback would at least make it clear if the crash occurs before the callback invokes lua or after... if it's before (or during, though during would put at least one of the lua_?call functions on the stack) then there would be no breadcrumb. But does that potentially leak private info?

cmsj commented 3 years ago

potentially it does, yes, in the sense that we'd see function names and filenames

cmsj commented 3 years ago

So, I've checked through the codebase and there's only one place we call CFRunLoopRun() - for enumerating mice in hs.mouse and it looks like we're doing it properly (ie with a custom event loop, so it shouldn't be processing any other events).

I did notice that when we do CFRunLoopAddSource() we're not very consistent about which runloop we ask to run on - sometimes it's CFRunLoopGetMain() and sometimes it's CFRunLoopGetCurrent(). In most cases those are probably the same, since GetCurrent() returns the runloop for the current thread, but we might want to review these just to make sure.

cmsj commented 3 years ago

Similarly we have some inconsistencies between adding source with either kCFRunLoopDefaultMode or kCFRunLoopCommonModes

asmagill commented 3 years ago

regarding the modes, I vaguely remember uncovering an issue with the mode used for timers when I was working on a menubar replacement (development on it stalled, because we fixed some errors in ours, but I eventually want to get back to it)... my replacement allowed for changes to the menu while it was open but timer's didn't fire while the menu was showing -- eventtaps did, though, which is why I looked at the modes and noticed the difference. I'll try to dig up the specific issue number when I get back home tonight, but if I recall my finding correctly, there were no issues when I changed the timer modes during my initial tests...

A long winded way of agreeing we should standardized, but I think we want to standardize on what eventtap uses in this case -- I'll confirm tonight.

latenitefilms commented 3 years ago

Food for thought... as CommandPost has a consistent Lua codebase (as in, very few, if any users modify the Lua code), it might be worth having a look at the CommandPost Sentry to see if there's any similar crashes between CommandPost and Hammerspoon?

CommandPost is pretty much always in-sync with the master Hammerspoon branch too, with the note that I generally update pods more regularly in CommandPost.

cmsj commented 3 years ago

@latenitefilms interestingly, you don't seem to have the same sorts of crashes in CommandPost

cmsj commented 3 years ago

I'm going to put out a release now, to get these updated sanitiser/assert things in the wild, and we can see what reports come in over the next week (assuming it doesn't all explode immediately!)

latenitefilms commented 3 years ago

@latenitefilms interestingly, you don't seem to have the same sorts of crashes in CommandPost

Well, that is... odd, given it's the same codebase, and I'm almost using every Hammerspoon extension.

cmsj commented 3 years ago

I take it as a positive sign - it likely means that people are using our API in ways we didn't think of and are triggering weirdness.

The challenge now, is to figure out what it is. If one/several of them would file GH issues, this would go a lot quicker, I suspect!

cmsj commented 3 years ago

So there's an interesting crash report now: https://sentry.io/organizations/hammerspoon/issues/2271630958/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d

It's only happening for one user so far, but the symptoms are super weird. It looks likeeverything goes wrong at MJLua.m:881 which is trying to save references to the completion word table and evalfn, but the check in luaRef for lua_isnil() (which is actually just lua_type() fails with an invalid index. The index we ask for is -1, which suggests that somehow setup.lua is returning nothing.

@asmagill does that look right to you? I'm struggling to see how a malformed init.lua could cause this, but it certainly looks like it's possible somehow

asmagill commented 3 years ago

@cmsj, can't fault your logic... I've come to the same conclusion. The only time I've seen something similar is when making updates to hs._coresetup or setup.lua directly, though usually that results in errors, not a shortened stack.

You could check the stack size before MJLua.m line 881 and throw an alert like we do for other setup issues, if you wanted to guard against this happening again and have a more "friendly" response... I'd also probably verify that they're the appropriate return types as well, just to catch all show-stopper possibilities.

In Skin.m, if you're concerned about the -1 index, you could wrap it with lua_absindex, but I suspect all that would do is shift where the error is triggered.

I also noted in Skin.m line 508 that if you do return early because the item to be stored is nil, it isn't removed from the stack while a successful call does... is this intentional?

cmsj commented 3 years ago

Agreed on checking the size and types of the stack. I'll get going on that.

There are some more crash reports rolling in today, which I'll be diving into later, but I'm also starting to think about how we can encourage the (very few) people who are getting these crashes, to talk to us. I believe Sentry offers some kind of UI for users to send a message with the crash report, so I'll look into that, but if anyone else has suggestions for ways to improve the communication here, I'm all ears :)

latenitefilms commented 3 years ago

Some random thoughts/questions:

Is it worth keeping a counter of how many Lua instances are created (i.e. how many reloads), or help narrow down if things are caused by an initial load of Hammerspoon, versus many reloads later?
I'm assuming this hasn't been implemented due to privacy concerns, but I'm assuming it's possible to include the last Lua command triggered (from either the Console or loaded .lua file) in the crash logs to help with fault-finding?
Is it worth removing setup.lua all together, and triggering that same code in Objective-C land? That way we can know for certain that the setup process is complete, and that _coresetup/init.lua is the first thing loaded. In theory, it would be great if Hammerspoon could still "load" even if coresetup fails for whatever reason.

latenitefilms commented 3 years ago

Also, FYI - I can no longer access Hammerspoon's Sentry account - it just says:

Authentication error: You do not have access to the required GitHub organization.

I assume this is because I'm not a member of Hammerspoon on GitHub.

I probably don't need access, so all good - but just wanted to let you know.

latenitefilms commented 3 years ago

Since we're apparently already paying the (approx 2x) performance cost of the Address Sanitiser in release builds

Ah, interesting... does this mean I can double the performance of Hammerspoon/CommandPost, simply by disabling the Address Sanitiser?

cmsj commented 3 years ago

Is it worth keeping a counter of how many Lua instances are created (i.e. how many reloads), or help narrow down if things are caused by an initial load of Hammerspoon, versus many reloads later?

It couldn't hurt!

I'm assuming this hasn't been implemented due to privacy concerns, but I'm assuming it's possible to include the last Lua command triggered (from either the Console or loaded .lua file) in the crash logs to help with fault-finding?

It's actually non-trivial - we'd essentially need to maintain a version of luaL_traceback that returns the traceback as a string rather than pushing it onto the Lua stack.

Is it worth removing setup.lua all together, and triggering that same code in Objective-C land? That way we can know for certain that the setup process is complete, and that _coresetup/init.lua is the first thing loaded. In theory, it would be great if Hammerspoon could still "load" even if coresetup fails for whatever reason.

Certainly our startup process has gotten very complex and fragile. I can definitely see an argument for separating out coresetup from the user's init.lua.

Also, FYI - I can no longer access Hammerspoon's Sentry account - it just says:

I just asked it to send you a link, but let me know if it still doesn't work. I also added you to the Hammerspoon org on github. I'm pretty sure you're well past the point where that is deserved :)

Ah, interesting... does this mean I can double the performance of Hammerspoon/CommandPost, simply by disabling the Address Sanitiser?

Turns out I was wrong, the Release scheme only has all those things enabled for Test builds. I'm going to experiment with whether I can produce a full release build with all those things enabled though - even if it's just for a week or two, I want to catch more crashers. The current release has shaken out a few more, from having Lua's assertions fully enabled.

asmagill commented 3 years ago

@cmsj, forget my comment above re timer and modes... the issue I was referencing was #1598 and it's already been addressed. And from the discussion there, my vote is standardizing on common modes.

cmsj commented 3 years ago

Ok, so now that 0.9.88 has been out for a few days, with the explicit Sentry events when checkLuaSkinInstance detects a UUID mismatch, I'm starting to see those explicit events: https://sentry.io/share/issue/33c1e2697b6548adb19633195ff181ca/

The log also shows that the watchers' luaSkinUUID is empty, which suggests (although not conclusively) that its __gc() was called, but somehow the OS listener wasn't removed. I now realise that I made a huge mistake by initialising luaSkinUUID to all zeros at initialisation and then setting luaSkinUUID[0] to \0 in __gc(), so I'll rework that so it's never empty (ie at startup I'll set it to one string, and at __gc() I'll set it to another, so we can tell exactly which lifecycle events have/haven't happened).

The reported events thus far have only been for hs.audiodevice and I would note that we're not checking the OSStatus return codes of AudioObjectRemovePropertyListener() so maybe somehow the OS is failing to remove all of them (in which case I guess we would have no choice but to intentionally crash).

The only other thing I have in mind is how we could run some tests that create a lot of watchers, cause a lot of events to happen, and reload the config a bunch of times, to try and provoke these crashes more directly, for debugging purposes.

Thoughts anyone? :)

latenitefilms commented 3 years ago

FWIW, I personally very rarely see CommandPost crash during reloads. The main time I see it crash is after waking my laptop from sleep. It'll be running when I put the Mac to sleep, but when I wake up, it's no longer running. Sentry tells you when most recent sleep event happened, right?

Did you end up already implementing a reload counter in the Sentry logs? If not, that might be interesting to see how many times Hammerspoon is reloaded before a crash occurs.

cmsj commented 3 years ago

I don't have a counter yet, and curiously, a lot of the crashes seem to be happening long after a reload (as much as a couple of hours later, in one instance).

cmsj commented 3 years ago

Good people of the Internet, I believe I have finally root-caused at least one variant of these crashes.

I was looking through the 8 hs.audiodevice related instances of checkLuaSkinInstance failing and I noticed that in one of them, while some time had passed between starting HS and the crash, at no point had Lua been reloaded.

That ruled out the idea that these objects were leftovers from a previous Lua instance. However, as I was looking around the hs.audiodevice code, I looked at audiodevice_callback() and started thinking about what our use of dispatch_async() actually means is going to happen.

The code will run on the next iteration of the Objective C runloop, but that won't be until Lua has finished doing whatever it was already doing, and that led me to the realisation that if Lua is working on stuff that happens to trigger one of these C callbacks, it's also possible that while Lua is then still doing other things, it might decide to garbage collect the object.

This leaves you in a situation where Lua has discarded an object, but Objective C is going to do something with that object on its next runloop iteration. Since Lua doesn't seem to defensively wipe the memory used by a userdata object, it's still likely to be present enough for the callback to be able to do something with the pointer, but it quickly goes off the rails.

I'm not 100% sure if this hs.audiodevice instance is identical to the ones that had been plaguing hs.timer and hs.eventtap, but I can well believe it is, and I guess we'll find out as 0.9.88 rolls out more widely and the Sentry data starts coming in.

For reference, the reproducer is:

foo = hs.audiodevice.defaultOutputDevice()
foo:watcherCallback(function(uid, event, selector, scope, element) print(uid, " :: ", event) end)
foo:watcherStart()
foo:setVolume(80)
foo = nil
collectgarbage()
collectgarbage()

(it only works if your audio volume is something other than 80% to start with)

The :setVolume() triggers the C portion of the watcher callback, which schedules part of itself to run on the main thread via dispatch_async(), but before that happens, foo (ie the hs.audiodevice object) is forcibly garbage collected.

So, now the hard part comes:

Do we switch to dispatch_sync() or do we stick with async and rely on checkLuaInstance to protect us? @asmagill this is definitely one I'd appreciate your help with!

latenitefilms commented 3 years ago

Awesome detective work!

Will leave @asmagill to discuss with you best plan of attack, as this is all very above my pay grade.

Curious... What's the disadvantage of using dispatch_sync()? Will it potentially slow things down or degrade performance? As Lua is single threaded, all the callbacks will be waiting on Lua to finish doing whatever it's doing anyway, right?

cmsj commented 3 years ago

@latenitefilms it's coming up on half-midnight here, I am not sufficiently brained to reason about that right now, but I believe the original thought behind using dispatch_async() was to let the C portion of the callback complete ASAP and do the Lua portion of the callback later. For things like hs.eventtap this actually matters because the OS will start dropping callbacks that take too long to execute, and there's potentially no upper limit on how long the Lua portion might run for.

latenitefilms commented 3 years ago

That make complete sense, in which case, relying on checkLuaInstance seems like a good solution?

cmsj commented 3 years ago

If my reasoning there is correct, then yes, by accident I wrote a debugging feature that ended up indirectly doing what we needed. If that is the decision we come to, I'll rework it so it doesn't log things to Sentry, because we will have agreed that this isn't really a bug, but an expected side-effect of userdata lifecycles, and then we'll need to roll it out to every C->Lua entrypoint.

It'll need a new name too, maybe something like checkGarbageCanary.

asmagill commented 3 years ago

Don't have the code in front of me at the moment, and I'm about to head out the door, but my first question is are we checking that the callback ref hasn't been cleared by __gc in the async block? That has been sufficient in most of the cases we've run into this runloop queue vs timing issue before.

In general async will allow the system to be more responsive, but sync is required if we require feedback from the Lua callback function... Switching everything to sync may make things less responsive in some cases.

asmagill commented 3 years ago

Ok, took a moment to look at the watcher code and it is checking, but it still creates the skin instance and issues the _lua_stackguard_entry(skin.L) before checking the callback ref value... do we know specifically which line in the audio callback is triggering the issue? Would checking the callback ref value before doing the stackguard check make a difference?

cmsj commented 3 years ago

that's a really good point actually, and explains why hs.audiodevice wasn't showing up in the crash reports before, because it does check the ref. hs.timer doesn't.

hs.eventtap does though, so I'm digging back through Sentry to try and find some of the crashes there to see where it was going wrong.

cmsj commented 3 years ago

hs.hotkey does check for LUA_NOREF and somehow was still crashing, so I'm now a little more confused than I was. I'm going to quickly push out 0.9.89 which adds checkLuaSkinInstance to hs.hotkey and I guess we'll need to wait a week or so until we have a good chunk of the userbase upgraded.

Edit: On second thought, I won't do that, I've changed enough stuff today that I should wait and test some more before releasing again.

latenitefilms commented 3 years ago

I'll merge all your changes into CommandPost today, and test it out on my machine, and let you know if anything weird pops up.

cmsj commented 3 years ago

@latenitefilms #2859 and #2860 will be relevant here - the former just renames all the luaSkinUUID stuff to lsGCCanary, but the latter adds canary checks to most of the modules that we/CP are seeing the crashes in currently.

cmsj commented 3 years ago

Ok, so now that we have data from 0.9.89 it's clear that we didn't fix everything here. There are a few crash reports that started out in hs.hotkey and looking more closely at the code, I understand why and have a rough idea how to fix it. There's also a couple from people running things in the Console, that I don't understand yet, and one from hs.chooser which makes some sense because I didn't add GCCanary support there.

Finally, there are the lingering crashes from hs.webview.toolbar that I suspect are related to garbage collection, so we should think about a smart way to avoid crashing in those scenarios.

latenitefilms commented 3 years ago

Yeah, apologies for all the CommandPost crash logs. I haven't recompiled yet, so my local CommandPost is crashing every time I reload, because I'm messing around with hs.serial.

cmsj commented 3 years ago

2867 is an attempt at fixing hs.hotkey, but I haven't had a chance to run it through its paces yet, but I did merge some hs.hotkey tests at least.

cmsj commented 3 years ago

I've not yet been able to manually reproduce the hs.hotkey crash, in part because .new() and .bind() are in Lua and keep a reference to the underlying userdata object, so it's not clear to me how we can get into a situation where the userdata object has been lost to garbage collection.

latenitefilms commented 3 years ago

Random idea... have you looked into the use of hs.shutdownCallback - I wonder if that could explain some of the weird things?

cmsj commented 2 years ago

Since @asmagill indicated he might look at the Sentry logs soon, I thought we could resurrect the discussion here...

There are a few relatively recent crashes which seem to come from Lua's API checks failing. I'm not sure if we've regressed or if users are doing a new thing.

But on the whole the crashes are coming from C callbacks jumping into Lua and then somehow a pointer is stale/wrong and things explode. hotkey and eventtap seem to be most common, but I suspect that's because those are the most used callbacks.

I thought the LSGCCanary checks would fix these, and while it has avoided some crashes, it hasn't had anything close to the impact I was hoping, so I'm currently out of ideas.

Edit: BTW I would recommend going to the Releases view in Sentry and picking 0.9.95 - we still have 30-40% of our users on previous versions, so looking at the raw Issues section will show your their crashes too, many of which we've fixed.

Hammerspoon / hammerspoon

Root causing our "weird" crash logs #2759

2867 is an attempt at fixing hs.hotkey, but I haven't had a chance to run it through its paces yet, but I did merge some hs.hotkey tests at least.