Open rimim opened 4 years ago
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.
/cc @Maoni0
Tagging subscribers to this area: @Maoni0 Notify danmosemsft if you want to be subscribed.
I'm using the GC outside of CoreCLR,
you mean you are using the GC dll with a different VM from coreclr?
I can't make a whole of sense from your debug output - a lot of this looked like added/modified by yourself. some observations - you mentioned "when an object gets pinned at or very near the beginning of a (gen 2) GC heap segment." but in the debug output the pin is very close to the beginning of the ephemeral segment. and I dunno which segment the plug that you are trying to call allocate_in_condemned
is on (since d3126ec4
doesn't appear to be on any of the segment in your segment list).
regardless, the problem is you are allocating into the space that you have not gone through, ie, when allocate_in_condemned
was called, it hasn't discovered the pin 0xd1ce60f8
on the ephemeral seg yet. so it allocated into the space that overlaps with the pin which obviously can't work. the question is why you are allocating into a space that you have not gone through because that would mean the compaction is actually taking up more space which should not happen. I would suggest to look from that perspective.
I'm using the GC outside of CoreCLR,
you mean you are using the GC dll with a different VM from coreclr? Correct.
I can't make a whole of sense from your debug output - a lot of this looked like added/modified by yourself. some observations - you mentioned "when an object gets pinned at or very near the beginning of a (gen 2) GC heap segment." but in the debug output the pin is very close to the beginning of the ephemeral segment. and I dunno which segment the plug that you are trying to call
allocate_in_condemned
is on (sinced3126ec4
doesn't appear to be on any of the segment in your segment list).
I apologize for extra debug info. Mostly it was just extra lines to see which conditional paths were taken. When I said that the object was at the beginning of a heap segment I mean it just started looking at a new heap segment from the heap_segment_next_rw
list. Shortly before the assertion you see this snippit:
end of seg: new tree, sequence# 0
From d1ce6000 to d1e095c0
Which corresponds to this conditional block in plan_phase
:
if (x >= end)
{
assert (x == end);
assert (heap_segment_allocated (seg1) == end);
heap_segment_allocated (seg1) = plug_end;
current_brick = update_brick_table (tree, current_brick, x, plug_end);
dprintf (3, ("end of seg: new tree, sequence# 0"));
sequence_number = 0;
tree = 0;
if (heap_segment_next_rw (seg1))
{
seg1 = heap_segment_next_rw (seg1);
end = heap_segment_allocated (seg1);
plug_end = x = heap_segment_mem (seg1);
current_brick = brick_of (x);
dprintf(3,( " From %Ix to %Ix", (size_t)x, (size_t)end));
continue;
}
else
{
break;
}
}
regardless, the problem is you are allocating into the space that you have not gone through, ie, when
allocate_in_condemned
was called, it hasn't discovered the pin0xd1ce60f8
on the ephemeral seg yet. so it allocated into the space that overlaps with the pin which obviously can't work. the question is why you are allocating into a space that you have not gone through because that would mean the compaction is actually taking up more space which should not happen. I would suggest to look from that perspective.
Thank you for your pointers. I'll continue to investigate.
Here is where things seem to go wrong during the gen2 plan_phase()
. When scanning very near the end of full heap_segment
it runs into a pinned object followed by marked objects. The plan_phase()
needs to create a plug in the next heap_segment
for the 12 byte pad.
For the problem to happen the pinned_plug_que_empty_p()
must be true
.
plug_start
points to the old heap_segment (d3cff000-d40feffc).
allocate_in_condemned_generations()
advances to the next heap_segment
(d33c0000-d37bfef8) and updates the generation_allocation_pointer
.
When it returns (plug_start - new_address)
will be problematic as plug_start
is in a different heap_segment.
--- Scanning this segment
====HEAP_SEGMENT : d3cff000-d40feffc
.....
d40fef98:20 [OBJECT] MARKED PINNED
d40fefac:20 [OBJECT] MARKED
d40fefc0:20 [OBJECT] MARKED
d40fefd4:20 [OBJECT] MARKED
d40fefe8:20 [OBJECT] MARKED
====END HEAP_SEGMENT
--- Advancing to this segment
====HEAP_SEGMENT : d33c0000-d37bfef8
d33c1000:12 [FREE]
d33c100c:20 [OBJECT] MARKED PINNED
d33c1020:20 [OBJECT] MARKED
d33c1034:12 [FREE]
d33c1040:20 [OBJECT] MARKED PINNED
d33c1054:20 [OBJECT] MARKED
......
plug_start = d40fefc0
calling allocate_in_condemned() gets a new_address
(c)[d40fefc0->d40feffc, NA: [d33c100c(13885364), d33c1048[: 3c(1)
|BAD| <-- (plug_start - new_address)
At this point the generation_allocation_pointer() points to d33c1048
, but when plan_phase() returns to the outer while loop:
while ((x < end) && marked (x))
The outer loop will start scanning from d33c1000
to d37bfef8
and when it tries to call set_pinned_info()
for d33c100c
it will assert because the plug is not in the range of the allocation segment (which was advanced above).
assert (!((plug < generation_allocation_pointer (gen)) &&
(plug >= heap_segment_mem (generation_allocation_segment (gen)))));
Without the assertion it will often cause a heap corruption.
N.B.: I'm running in 32-bit without concurrency.
it's perfectly legal for the new_address to be on a different segment. (plug_start - new_address) can be a large value - it's just there for informational purposes. the question is still, why would it be allocated on a segment that it has not gone through,therefore the pinned plug has not been discovered. when we are attempting to allocate at the end of a segment we make sure we don't pad at tail.
For the issue to manifest you need at least two gen2 heap_segments
. The first heap_segment
must be at capacity and cannot be grown. The heap_segment
contains a pinned object followed by at least 2 (or more) marked objects and no free space. The last marked object must exist at the VERY END of the GC heap_segment
. The following heap_segment
needs a marked object (pinning isn't important) followed by some free space.
Then during a full GC the pinned plug will be extended to the first marked object, creating an extended plug. A normal plug is made of the second marked object. This creates a "saved_post_plug" before the second marked object - normal operation.
The issue comes with the second plug that gets logically allocated into the next heap_segment
, since there is no gap to hold the plug in the current heap segment. This is essentially the example within p639 of the 'Pro .NET Memory Management' book, but the last/second plug is right up against the end of the heap_segment
. If there is no room at the beginning of the next GC heap_segment
(assume free/marked/free objects at the beginning of the next heap_segment
but this second plug can't fit into the first free space), this second plug gets logically allocated in a heap_segment
that has not been scanned yet. If assertions are enabled, then this condition gets asserted during the GC plan phase because heap corruption is likely to occur. A segfault occurs later within the GC plan (or compact) phase when scanning that next heap segment as we have an overlapping allocated area within the next heap_segment
.
A simple hack to work around the issue (not a fix because it is incomplete and horrible) is to pin any object following a pinned object on the last page of gen2 heap_segment that has a following heap_segment and no trailing free space. With that hack in place I am no longer able to crash the GC nor trigger an assertion using my stress test. The hack is incomplete because the pin could happen before the last page and be followed by marked objects to the end of the heap_segment
.
I'm going to see if I can figure out the correct solution, but wanted to share my findings with your team. So far I've been unsuccessful in reproducing the problem using C#. It's a pretty particular boundary condition, so it doesn't mean that it doesn't exist. Perhaps it's just very rare.
And yes not pinning anything also solves the issue.
Description
I'm investigating a very sporadic heap corruption that seems to happen when an object gets pinned at or very near the beginning of a (gen 2) GC heap segment. This is an observation based on looking at the heap after the crash. I'm able to reproduce a similar crash more reliably by using a stress test that pins and unpins a million objects randomly in 10 threads and one thread that periodically requests a full GC every 30 seconds. I run 60 or more instances of the test concurrently and can then trigger a failure in 2-10 minutes.
Turning on assertions I seem to consistently see the following assertion trigger:
Here is an excerpt from the
plan_phase
debug output (I've attached the full log). N.B.: I've added a couple of print outs.If the test runs without assertions it will later crash in
process_ephemeral_boundaries()
wherenseg
becomesNULL
and crashes callingheap_segment_mem(nseg)
.process_ephemeral_boundaries
I'll continue to investigate, but was curious if you folks could manage some pointers (ha! gc-jokes is there anything funnier).
Full log file
I've attached the tracing information from the
plan_phase()
File: log_47.txt.gzConfiguration
Linux x86
Regression?
This issue happens with any version of the GC.
Heap segments
Other information
I understand that pinned objects are bad and should be avoided. I'm using the GC outside of CoreCLR, but will attempt to reproduce the same issue using CLR.