Missing Ptrs / Bounty US 1000$ to fix the Garbage Collector

sibest commented 3 years ago

Hi,

I've already worked on the GC before. Basically, some pointers are missing probably in MarkConservative or somewhere else, and this is leading my game (http://playruneverse.com) to random crashes.

I offer a 1000$ Bounty to help me fix the GC. mark_crash_2

waneck commented 3 years ago

Hi! This looks familiar. Can you try merging https://github.com/proletariatgames/hxcpp/pull/10 into your project and seeing if that fixes it?

sibest commented 3 years ago

No luck : (

hughsando commented 3 years ago

There are a number of reasons why this could be happening - it is either:

a bug in the GC system
the compiler optimizing something in an unexpected way
a missed reference due to an oversight in the haxe generated code
an issue with native integrations
- threads
- storing haxe objects.
something new

And the diagnostic tests:

The crash you see here is in the multi-thread marking. The problem is (very probably) not there, but earlier when either a live object overwrites some random memory, or more likely a reference got missed somewhere and this memory is allocated to a different object. To debug this, uncomment the line: #define HXCPP_GC_DEBUG_LEVEL 1 in hx/gc/Immix.cpp This will run a bit slower, but not so bad that you can't leave it on while testing. This will generate a much more informative callstack in the case of a crash.
Does the problem happen in debug mode? If not, it might be an optmization issue.
Are you using generational garbage collection? If so, adding the haxe defines -D HXCPP_CHECK_POINTER -D HXCPP_GC_CHECK_POINTER could help identify the problem early. This may slow things down though.
Do you run many threads or native integrations? One thing to watch out for is callbacks from native code, such as audio sub-systems, or async responses.
We will see

Is it easy in general is it to get it to crash? If the bug happens early, defining HX_GC_FIXED_BLOCKS in Immix.cpp can help with getting consistent crashes at the same memory location, which can be handy for debugging.

The first thing to try is HXCPP_GC_DEBUG_LEVEL and this might point right to the problematic object, and go from there.

sibest commented 3 years ago

Hi @hughsando

Yes it happens in debug mode too
It's happening both in non-GC and GC almost
It's crashing very early, so not using threads or callbacks

Even with HX_GC_FIXED_BLOCKS it's crashing randomly (maybe in the stack there are previous values even after the program run?)

Anyway using recursive marking (inPtr->Mark(inCtx);) the program does not crash. But i've checked the code of MarkContext and MarkChunk and seems correct. Maybe it's the marking order?

Do you have telegram? I will pay your time. Thank you much

sibest commented 3 years ago

No, it's crashing with recursive marking too. Just later. <-- NO Sorry, it was crashing because I had FIXED BLOCK enabled and i limited it to 100mb only. With recursive calls it does not crash.

sibest commented 3 years ago

I've one question:

In MarkConservative: You manage sgCheckInternalOffset in GetEnclosingAllocType, but this way the sgCheckInternalOffset could be outside the boundary of the vptr block and miss the allocation.

Shouldn't be more appropriate to do something like this: for(int *ptr = inBottom ; ptr<inTop; ptr++) { for(int dx=0;dx<=sgCheckInternalOffset;dx+=4) { void *sptr = *(void **)ptr; void *vptr = (char*)sptr - dx; in MarkConservative?

hughsando commented 3 years ago

I think these are equivalent because if it is truly a pointer to the middle of an object, then the start and the middle should both be in the same vptr block since real objects so not span blocks. Does this make sense?

It is possible there is a bug in the MarkChunk code, but this has been reasonably well battle tested. Perhaps in some unusual case like a massive linked-list or similar, or maybe a low-memory situation (which it does not look like here).

The ability to reproduce bugs consistently can be influenced by things such as an interactive startup, where a mouse-event can introduce some randomness. It looks like the first bug is coming from an Xml object - sometimes these things can be reproduced more reliably by, for example, putting the xml parsing in a loop and running until it crashes. If this is the case, then you might be able to isolate a small amount of code the reproduces the bug. If you send this to me, I should be able to get to the bottom of it.

But to summarize:

it happens in debug mode. This rules out most optimization problems and probably sgCheckInternalOffset too, which usually happens when windows optimizer keeps a pointer to the data pointer in an array, rather than the beginning of the array.
It happens in non-generational mode? This rules out quite a few possibilities
debug_level 1 fixes things so could be a MarkChunk issue. There are a lot of checks for MAX_MARK_THREADS > 1 in Immix.cpp. If these are changed to >0 then the Chunk code will get run with a single thread. If this reintroduces the problem, then there could be an issue with the MarkChunk logic, it not, then it could be a race condition in the MarkChunk code. But this could also be a side-effect of something else, like memory overwrite.
Threading and native code is pretty much ruled out

My best guess now is a bug in the xml objects internal implementations , which include maps and arrays and dynamic arrays and pretty much all the internal types. Next would be a MarkChunk race condition.

If you can get a program to crash in an xml parsing loop, that would be super helpful.

waneck commented 3 years ago

Can you get it to crash on Linux? We've found that it's very useful to use https://rr-project.org/ to debug GC crashes, since you can trace back what happened to that pointer before

sibest commented 3 years ago

Unfortunately my project doesn't compile on linux. Is it possible that Map<ObjectKey,Something> ObjectKey are not marked ? Maybe only Values are marked ?

sibest commented 3 years ago

Ok no, I've tested the Hash and it's working properly, plus i've tried forcing the mark of weak pointers too and it's crashing the same way. I've tried to change the MarkChunk thing with a simple list and it's crashing the same way, so there is no fail in MarkChunk. About Xml, sometime it crashes there, sometime a little later, sometime before the Xml.

sibest commented 3 years ago

It's crashing also in Mac and iPhone.

sibest commented 3 years ago

I think I got it. Random stack data could fool GetAllocTypeChecked: it must check that the pointer is not inside an hole. Moreover allocStart flags must be set to 0 in reclaim when the rowMarker is empty, otherwise GetAllocType can fail with the nursery check. Tomorrow I will issue a pull.

sibest commented 3 years ago

https://github.com/HaxeFoundation/hxcpp/pull/946

sibest commented 3 years ago

Even if this pull reduces the problem by a lot, the only true solution is to update allocStart everytime (full collect). It's possible that with an outdated allocStart flag, an invalid pointer can be found on the stack pointing to a valid allocated space inside the blocks with a fake valid header leading to a crash. Very very rare, but can happen. Other solutions are appreciated. Maybe the zero thread can update the allocStart flag while not collecting? Or, we can scan the nursery outside holes too and skip the allocStart flag.

hughsando commented 3 years ago

Checking again:

this only happens for generational
it also happens in debug mode

In debug mode sgCheckInternalOffset can be set to 0 IIRC. You will also note that this is the case for non MSVC so if the bug happens in mac/ios the internal offset is probably not the cause.

Also, register capture is not normally an issue in debug mode, and again if it happens on Arm64, it is also unlikely to be the issue.

If it is just an generational issue, then there is the HX_GC_VERIFY define that performs a second pass to check for missed object, although this might not help if it is a conservative marking issues.

The general principle is that in nursery holes, allocStart should be zero, and the only way to find an object start is to scan the size links. When a nursery object is marked, it should set the allocStart and then normal logic applies. Maybe this setting should be delayed until the conservative marking is over so it does not affect other objects on the same row, or every object on that row should be promoted, or use an explicit check for inHole like you suggest. I also note that there are some thread primitives in this code and a bug would fit with the symptom of not breaking in a single-threaded recursive mark situation.

sibest commented 3 years ago

Actually it's simple to understand:

0                             Row 0                             127
+---------------------------------------------------------------+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 
+---------------------------------------------------------------+
                        32 allocStart flags

In the first time two allocations are made in Row 0:

0                             Row 0                             127
+---------------------------------------------------------------+
|1| | | | | | | | | |1| | | | | | | | | | | | | | | | | | | | | | 
+---------------------------------------------------------------+
| Allocation 1 = 40 bytes
| Allocation 2 = 88 bytes - full row
+---------------------------------------------------------------+

2x allocStart flag

Then a NON-full collect goes on ad frees the entire row (but not the allocStart flags)
A new allocation takes place:

0                             Row 0                             127
+---------------------------------------------------------------+
|1| | | | | | | | | |1| | | | | | | | | | | | | | | | | | | | | | 
+---------------------------------------------------------------+
| Allocation 1 = 96 bytes 
+---------------------------------------------------------------+

As you can see the second allocStart flag is still set to 1

MarkConservative then kicks in, in the stack a valid pointer that is pointing to the old second allocation is found AND for for a coincidence the 96 bytes allocation (which is for example a small ByteArray) contains data at the place of the second old allocation WHICH is considered a valid header <-- For some reason the markId is considered valid. But it's not a real allocation, so it tries to MarkAllocObject and then CRASH.

VisitBlock is also flawed as it only relies on allocStart flag to check if an allocation is starting there, BUT with previously outdated set allocStart the same thing goes on.

hughsando commented 3 years ago

All reclaims (full or not) should ZERO_MEM all the allocStarts for the hole, so it should not be possible to allocate an object that covers an existing allocStart mark. https://github.com/HaxeFoundation/hxcpp/blob/master/src/hx/gc/Immix.cpp#L1081

Because it is expensive to update the allocStarts, their meaning is "There was an object allocated here within the last N non-generational collects" rather then "there is certainly an object here".

The starts are only for conservative object detection. The row marks are for free row and hole detection.

Event with an allocStart, an object could be stale (the header is good, but the contents are bad) if it is not from the current allocation cycle. So an additional check is needed for the allocation cycle, time==gByteMarkID. But only a few bits are used for storing this time, so the allocStarts are updated before the time can "clock over". This just removes stale object marks, but does not free the row. At the time that this is run we know rowMarked[r] is true - ie, so at least something on the row is alive.

The nursery objects are different because they do not get a allocStart mark until we are certain we are keeping them. Instead we must follow the linked-list of sizes to see if it lands on an object pointer.

So I'm thinking either marking one conservative object messes with the size linked-list somehow and prevents other objects from being retained, or the HxAtomicExchangeIf is not working as I intended or there is something wrong at a higher level.

sibest commented 3 years ago

You are right, it's something else. It's the size linked-list which is broken. Maybe a write overflow somewhere. I will see..

hughsando commented 3 years ago

Yes, I think I see the problem - it is with allocStart as you suggest, when a previous row has an object at the end, and the next row is the start of a hole. The fix is pretty simple because we know the trailing object is invalid.

sibest commented 3 years ago

I see but did you change only the code when sgCheckInternalOffset is active? Because it crashes on mac too

hughsando commented 3 years ago

Yes indeed. May need to look for a second bug because this is specific to offsets that can span rows. I have added a define HX_GC_ZERO_EARLY which can help with some reproducibility issues. But back to the diagnostics - does it happen in debug mode, is it only generational? If debug fixes it, it could be that mac now needs sgCheckInternalOffset too.

sibest commented 3 years ago

Unfortunately it's crashing in debug mode too on Mac

sibest commented 3 years ago

Step 1: Write objects into HOLE 0 Inside the nursery Step 2: Collect. HOLE0 is now free again Step 3: Collect, MarkConservative scan for the nursery and find some valid header inside Hole0 of older objects (the hole was not zeroed). Can this give a problem ?

hughsando commented 3 years ago

So debug mode mostly rules out sgCheckInternalOffset and register capture. Next would be to put on HX_GC_ZERO_EARLY and HXCPP_GC_DEBUG_LEVEL 1. I will see if I could reproduce something on mac.

sibest commented 3 years ago

Maybe the allocStart flag should be set for nursery object too, and the header checked to verify if it is nursery or marked object?

hughsando commented 3 years ago

It should not be the problem because the MarkConservative happens after the writes are finished and before they are zeroed. The blocks have a flag to say whether they are zeroed - could check this at runtime. They should be zeroed before they are used unless there is a bug. The HX_GC_ZERO_EARLY should make sure this is true, so if this fixes things, there may be a logic bug.

sibest commented 3 years ago

It's fixed in my test build now, i tell you what changes I made in a second.

hughsando commented 3 years ago

Once it gets to a point where I am out out of ideas, the next step is to add code to verify the assumptions. For example, in some kind of debug mode, create a separate "nurseryAllocStart" array and set this for all objects. Then in the collection code, loop over everything and verify that these flags exactly match the GetAllocType logic - or at least for every conservative object. This should pinpoint the problem if it is there.

sibest commented 3 years ago

About the allocStart for nursery object i was thinking at something like this: Step 0: Hole 0 is used to write an object that refers to another object in Hole 1. Step 1: Hole 0 and Hole 1 are reclaimed - but the nursery data is still there. Step 2: Hole 1 is used and zeroed by another block of some kind. Step 3: Some pointer found on stack goes on Hole 0 and revives the old object, and marks it, it's true to say that that object never stopped living, but the othe objects refered to him did. So crash.

I think that a nurseryAllocStart flag is necessary at this point, what do you think ?

hughsando commented 3 years ago

This really should not happen because after Step 1, the data might be there for a little while, but by the time Step 3 happens one of these things should be true:

The row with the data on is kept alive by another object, in which case it will not be in an hole range so should not show up as a nursery object.
The row was not kept alive, so it should be zero and/or resused with valid objects
The row was reused and the stack accidentally points to new object - this is ok, the valid object will stay alive longer.

Point 1 seems the most likely to be wrong. It would need to have a valid allocStart flag and a valid time code but have been missed on the stack at least once so its contents have been collected.

hughsando commented 3 years ago

Now I think about it, that might be what is going on: generational pass 0 finds object (say array) on the stack and marks it generational pass 1 does not find it, so the array contents are trashed, but the header remains because the row is kept alive. generational pass 2 finds the valid header again and tries to use it and it crashes. This would not be an allocStart thing exactly. I may need to remember all conservative marks until the time code is increased at the next non-generational cycle. Not sure how to test this without hacking some special test code to see if I can break it this way.

hughsando commented 3 years ago

Thinking about how your fix may work The extra reclaim effectively clears the allocStart where there are no row marks. I wonder if this is all that needed. If so, I would probably look at how it might be that this occurs. The other thing is that it is on the transition from generational to non-generational, which might need some more thought.

sibest commented 3 years ago

Yes, somehow it works.

sibest commented 3 years ago

It's completely fixed now! I will upload my PR, i just changed the order of reclaim/MarkAll in Collect as mentioned before, plus I added an "used" flag to holes. So that GetEnclosingNurseryType doesn't scan "non-used" holes. This way, my game doesn't crash anymore, neither in stress-test!

hughsando commented 3 years ago

So I am reluctant to add the additional reclaim if it is not required. Do you do explicit collections?

Your used-hole code may be equivalent to putting this at the top of GetEnclosingNurseryType. This would be a simple fix if it works:

      if (!mZeroed)
         return allocNone;

sibest commented 3 years ago

It works.

sibest commented 3 years ago

Actually, with

   if (!mZeroed)
         return allocNone;

is enough. The double reclaimBlocks is no longer needed. I struggled so much in this GC code and the solution are just two lines, was about to drop my project.

sibest commented 3 years ago

While on Mac and Windows is working perfectly, on Android is still crashing. I've tried also enabling the sgCheckInternalOffset but no change. Probably a register capture problem ?

Android (64 bit) crashing either with setjmp and debug.
iOS (64 bit) working fine.

hughsando commented 3 years ago

Well, that is good news that the mZeroed works - it is a simple little fix.

While these are clearly bugs with the generational code, I'm not sure why you have had them and others not. Maybe they have not run it enough? Is there something special you can think of that you are doing that would make it more likely to crash - I ask because I would like to create some similar tests.

So on android it looks like it not missing stack variables because of debug/setjmp. The compilers are quite similar for arm64, so that probably rules a few things out too. My first suspicions for android are to do with threading - because generally the apps are multi-threaded (render/ui) and they have a different set of atomics. Have you tried non-generational? This could point in certain directions. The other thing would be to try a terminal style command-line exe run over adb and/or MAX_GC_THREADS=1 to rule out thread issues.

sibest commented 3 years ago

Good news! I've implemented ARM64 register capturing and now it's working perfectly on Android too!

void CaptureARM64(RegisterCaptureBuffer &outBuffer)
{

   void *regX0;
   void *regX1;
   void *regX2;
   void *regX3;
   void *regX4;
   void *regX5;
   void *regX6;
   void *regX7;
   void *regX8;
   void *regX9;
   void *regX10;
   void *regX11;
   void *regX12;
   void *regX13;
   void *regX14;
   void *regX15;
   void *regX16;
   void *regX17;
   void *regX18;
   void *regX19;
   void *regX20;
   void *regX21;
   void *regX22;
   void *regX23;
   void *regX24;
   void *regX25;
   void *regX26;
   void *regX27;
   void *regX28;
   void *regX29;
asm ("mov %0, x0\n\t" : "=r" (regX0) );
asm ("mov %0, x1\n\t" : "=r" (regX1) );
asm ("mov %0, x2\n\t" : "=r" (regX2) );
asm ("mov %0, x3\n\t" : "=r" (regX3) );
asm ("mov %0, x4\n\t" : "=r" (regX4) );
asm ("mov %0, x5\n\t" : "=r" (regX5) );
asm ("mov %0, x6\n\t" : "=r" (regX6) );
asm ("mov %0, x7\n\t" : "=r" (regX7) );
asm ("mov %0, x8\n\t" : "=r" (regX8) );
asm ("mov %0, x9\n\t" : "=r" (regX9) );
asm ("mov %0, x10\n\t" : "=r" (regX10) );
asm ("mov %0, x11\n\t" : "=r" (regX11) );
asm ("mov %0, x12\n\t" : "=r" (regX12) );
asm ("mov %0, x13\n\t" : "=r" (regX13) );
asm ("mov %0, x14\n\t" : "=r" (regX14) );
asm ("mov %0, x15\n\t" : "=r" (regX15) );
asm ("mov %0, x16\n\t" : "=r" (regX16) );
asm ("mov %0, x17\n\t" : "=r" (regX17) );
asm ("mov %0, x18\n\t" : "=r" (regX18) );
asm ("mov %0, x19\n\t" : "=r" (regX19) );
asm ("mov %0, x20\n\t" : "=r" (regX20) );
asm ("mov %0, x21\n\t" : "=r" (regX21) );
asm ("mov %0, x22\n\t" : "=r" (regX22) );
asm ("mov %0, x23\n\t" : "=r" (regX23) );
asm ("mov %0, x24\n\t" : "=r" (regX24) );
asm ("mov %0, x25\n\t" : "=r" (regX25) );
asm ("mov %0, x26\n\t" : "=r" (regX26) );
asm ("mov %0, x27\n\t" : "=r" (regX27) );
asm ("mov %0, x28\n\t" : "=r" (regX28) );
asm ("mov %0, x29\n\t" : "=r" (regX29) );
  outBuffer.x0 = regX0;
  outBuffer.x1 = regX1;
  outBuffer.x2 = regX2;
  outBuffer.x3 = regX3;
  outBuffer.x4 = regX4;
  outBuffer.x5 = regX5;
  outBuffer.x6 = regX6;
  outBuffer.x7 = regX7;
  outBuffer.x8 = regX8;
  outBuffer.x9 = regX9;
  outBuffer.x10 = regX10;
  outBuffer.x11 = regX11;
  outBuffer.x12 = regX12;
  outBuffer.x13 = regX13;
  outBuffer.x14 = regX14;
  outBuffer.x15 = regX15;
  outBuffer.x16 = regX16;
  outBuffer.x17 = regX17;
  outBuffer.x18 = regX18;
  outBuffer.x19 = regX19;
  outBuffer.x20 = regX20;
  outBuffer.x21 = regX21;
  outBuffer.x22 = regX22;
  outBuffer.x23 = regX23;
  outBuffer.x24 = regX24;
  outBuffer.x25 = regX25;
  outBuffer.x26 = regX26;
  outBuffer.x27 = regX27;
  outBuffer.x28 = regX28;
  outBuffer.x29 = regX29;

sibest commented 3 years ago

GcRegCapture.zip Arm64 + Armv7 register capture. Win+Mac+iOS+Android - GC stresstest passed!

hughsando commented 3 years ago

Fantastic that is all working.

I see the Arm64 designation Callee-saved registers (X19-X29), so these definitely need to be dealt with. I'm a bit surprised setjmp did not capture all of these. But this looks very nice. Do you think you could run the stress test again only saving x19 to x29 ? Might not be much difference in speed, but could potentially get run billions of times so snipping a few cycles is worthwhile.

sibest commented 3 years ago

Ok I'm going to test. Do you think that for Win 32 the /Oy option (omit framepointer) can lead to ptr miss ?

hughsando commented 3 years ago

I have made a tweak to the setjmp code, and would appreciate knowing whether this fixes the original issues. There is also a reduced "reg move" solution in there which may do a similar job - I'm hoping this is slightly better than the full "reg move" solution.

rainyt commented 1 year ago

Hi, All, I still have a crash in IOS ARM64

I try to use Git hxcpp, crash.
I try to use -O1, crash too.

What might I have missed?

HaxeFoundation / hxcpp

Missing Ptrs / Bounty US 1000$ to fix the Garbage Collector #945