HaxeFoundation / hxcpp

Runtime files for c++ backend for haxe
Other
295 stars 184 forks source link

[4.1.5] Sudden glut of EXC_BAD_ACCESS (KERN_INVALID_ADDRESS) after shipping app update #1006

Closed jskuse closed 2 years ago

jskuse commented 2 years ago

Hi @hughsando! We recently shipped an update to our HXCPP 4.1.5 (and NME!) based game and have seen our previously extremely low crash rate suddenly balloon on iOS, in a way that's leaving me a bit flumouxed - see report below. The specific EXC_BAD_ACCESS seems to occur specifically when calling map.keys() at a specific point, however the code makes that call hasn't changed in this update. But thinking about it, there is a difference in that it is occurring in a higher memory use context than previously.

Given the KERN_INVALID_ADDRESS 0x0000000000000008 my immediate gut reaction is maybe some nebulous GC "stuff" that's way outside my sphere of expertise.

We'd held off updating to 4.2.1 because of https://github.com/HaxeFoundation/hxcpp/issues/957 and I've been super reluctant to update beyond 4.1.5 since it has been incredibly stable for us up until now. It doesn't look like there's been an official HXCPP release since 4.2.1, which makes me think maybe we should be giving the GIT version a go? But obviously, with stability being paramount, I'm loathe to to try that without a sense that it's likely to fix it.

Unfortunately, our QA hasn't been able to repro the issue, so it's going to be very difficult to verify outside of shipping an update and hoping.

I hate to be this vague, but do you have any insight here?

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0000000000000008
Exception Codes: 0x0000000000000001, 0x0000000000000008
VM Region Info: 0x8 is not in any region.  Bytes before following region: 4310515704
      REGION TYPE                 START - END      [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      __TEXT                   100ed4000-100ed8000 [   16K] r-x/r-x SM=COW  ...ading For Fun
Exception Note:  EXC_CORPSE_NOTIFY
Termination Reason: SIGNAL 11 Segmentation fault: 11
Terminating Process: exc handler [10511]

Triggered by Thread:  0

Thread 0 name:
Thread 0 Crashed:
0   [game name]                 0x00000001016259d0 __string_hash_keys(Dynamic&) + 16
1   [game name]                 0x00000001010dbf54 haxe::ds::StringMap_obj::keys() + 64
hughsando commented 2 years ago

The memory address 0x0008 looks like some integer has been written over the String map variable. The two main reasons for this are:

  1. Threading
  2. GC IOs + NME generally does not use threading - unless you are doing something yourself (eg, special http loader callback) it is probably not threading. But you could see if something is trying to write to your map while it is being read from. Also probably not an issue if you initialize it once and the use it.

So this probably leaves GC. There was some issues with IOS when app start code expanded and this messed some some GC stack settings. Depending how old your system is, there was this: https://github.com/haxenme/nme/blob/master/templates/ios/UIStageView.mm#L2009

That is, if you look at your crash call stack, at the first haxe/nme call, is it in a "OnEvent" callback (or similar) that starts with: gc_set_top_of_stack ?

Does the crash happen on startup, or after some frames have been renderer

jskuse commented 2 years ago

Thanks for the speedy response Hugh, you're a star!

That's a great bit of insight.

Firstly, we do use threading (custom audio and http systems, image loading when creating the world to stop it stalling), but they've both been live for a long time / heavy use without issue and have been carefully set up to be safe. So while not entirely impossible, it's not been my first port of call so far.

I can see that the crash is happening deep into gameplay; the stack is the same with every log allowing me to pinpoint quite accurately where in the game it is happening, which seems to suggest that it's crashing at exactly the same place every time (but so far not in a replicable way!).

However it does happen immediately after loading a large resource. The map in question contains a load of haxe.io.Bytes indexed by string, which have been parsed from a custom binary archive format and the crash occurs immediately after this parse action, when it tries to iterate over the entries.

The game itself is semi-open world so the steps the player would have gone through (and thus the state of the memory) would be quite different from player to player. When we've experienced GC related issues in the past there was some variance. Would the fact that it's happening the same place every time (albeit after a big load operation) make it less likely to be GC related?

I can confirm that our UIStageView OnEvent function is identical to the current version you linked. And re. the the lower part of the stack, it looks like this:

31   [game name]                0x00000001017ed6bc external_handler_native(nme::Event&, void*) + 164
32  [game name]                 0x00000001017e8ec0 nme::Stage::HandleEvent(nme::Event&) + 988
33  [game name]                 0x000000010169f314 NMEStage::OnEvent(nme::Event&) + 344
34  [game name]                 0x000000010169f1a8 -[NMEAnimationController mainLoop:] + 176
35  QuartzCore                      0x00000001a21ee840 CA::Display::DisplayLink::dispatch_items(unsigned long long, unsigned long long, unsigned long long) + 748 (CADisplay.mm:4175)
36  QuartzCore                      0x00000001a21f5878 display_timer_callback(__CFMachPort*, void*, long, void*) + 372 (CADisplayTimer.cpp:219)
37  CoreFoundation                  0x000000019e4a6bf0 __CFMachPortPerform + 176 (CFMachPort.c:549)
38  CoreFoundation                  0x000000019e4ea258 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 60 (CFRunLoop.c:1996)
39  CoreFoundation                  0x000000019e4eda64 __CFRunLoopDoSource1 + 604 (CFRunLoop.c:2136)
40  CoreFoundation                  0x000000019e4a7664 __CFRunLoopRun + 2372 (CFRunLoop.c:3172)
41  CoreFoundation                  0x000000019e4babc8 CFRunLoopRunSpecific + 600 (CFRunLoop.c:3268)
42  GraphicsServices                0x00000001ba626374 GSEventRunModal + 164 (GSEvent.c:2200)
43  UIKitCore                       0x00000001a0e2eb58 -[UIApplication _run] + 1100 (UIApplication.m:3511)
44  UIKitCore                       0x00000001a0bb0090 UIApplicationMain + 364 (UIApplication.m:5064)
45  [game name]                 0x00000001016a2768 nme::CreateMainFrame(void (*)(nme::Frame*), int, int, unsigned int, char const*, nme::Surface*) + 200
46  [game name]                 0x0000000100f796c0 nme::app::Application_obj::createWindow(Dynamic, Dynamic) + 1840
47  [game name]                 0x00000001014e3808 nme::Lib_obj::create(Dynamic, int, int, hx::Null<double>, hx::Null<int>, hx::Null<int>, String, hx::ObjectPtr<nme::display::BitmapData_obj>, Dynamic) + 768
48  [game name]                 0x00000001015607ac ApplicationMain_obj::main() + 392
49  [game name]                 0x00000001015d4a0c __hxcpp_lib_main + 44
50  [game name]                 0x000000010164ec3c hxRunLibrary + 20
51  [game name]                 0x0000000100edbc40 main + 32 (Main.mm:22)
52  dyld                            0x0000000101e59da4 start + 520 (dyldMain.cpp:879)
hughsando commented 2 years ago

Yes, that call stack looks ok as far as top-of-stack goes. Is that binary format parser written in haxe, or does it use native components? Do you use any of the cpp.Native routines to read directly into the data? If you are using pure haxe, it would make a GC crash less likely.

One thing you may do to try to get a reliable crash is sprinkle explicit gc.collect calls ( cpp.vm.Gc.run(true) ) calls into the suspect (large loading) routine. Start with immediately after, then add some to the beginning and the middle.

You can also try setting the define HXCPP_GC_DEBUG_LEVEL in Immix.cpp This may help show any errors. But also, it has been known to prevent errors showing up.

There is also the defines: HXCPP_CHECK_POINTER and HXCPP_GC_CHECK_POINTER. The second ones does extra checks to try identify bad objects as early as possible, which can help pinpoint what when wrong first, rather than some knock-on effect. They may also show problems earlier in QA rather than production. You can add these in the haxedef in the nme config file.

jskuse commented 2 years ago

Hi again @hughsando!

Firstly thanks again for your help on this, it makes it so much easier when running into these things to get great responses like these.

To answer your direct question, no the unpacking stuff is all just haxe.io.Bytes shenanigans, so doesn't really look suspect. But then nothing else does either.

However, there is good news! Since there was anything obviously wrong and we just could not reproduce, I took the decision to experimentally refactor that code to remove the intermediate step of loading all the bytes into a map before consuming them and replaced with an inline version of the unpacking code that simply did what needed to be done with each virtual "file" immediately as it was unpacked, thus removing the problematic map.keys() call.

I figured whatever gremlin we had would probably just appear somewhere else, but felt it was worth a go even if just for diagnostic reasons. However, to my surprise this has completely eradicated the crashes. So I'll close this issue for the time being.

What I will try and do when things settle down is roll back to the previous code and rebuild with your suggested flags and a liberal sprinkling of GC collects and see if we can find out any more.

Thank you again for your help, plus the for HXCPP and NME; you're a big part of making hundreds of thousands of kids happy every month!