dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

Pinned object problems #37205

Open rimim opened 4 years ago

rimim commented 4 years ago

Description

I'm investigating a very sporadic heap corruption that seems to happen when an object gets pinned at or very near the beginning of a (gen 2) GC heap segment. This is an observation based on looking at the heap after the crash. I'm able to reproduce a similar crash more reliably by using a stress test that pins and unpins a million objects randomly in 10 threads and one thread that periodically requests a full GC every 30 seconds. I run 60 or more instances of the test concurrently and can then trigger a failure in 2-10 minutes.

Turning on assertions I seem to consistently see the following assertion trigger:

static void WKS::gc_heap::set_allocator_next_pin(WKS::generation*): Assertion !((plug < generation_allocation_pointer (gen)) && (plug >= heap_segment_mem (generation_allocation_segment (gen))))

Here is an excerpt from the plan_phase debug output (I've attached the full log). N.B.: I've added a couple of print outs.

    ....
changed limit to commit: d1e17000
SANP: gen2, ptr; d1ce6000, limit: d1e17000

aic: ptr: d1ce6140, limit: d1e17000, sr: d1ce6000
  allocate_in_condemned 56b73760, 308, 2, 0 0, d2d27000  -------> d3126ec4
  new_address
(c)[d3126ec4->d3126ff8, NA: [d1ce600c(21237432), d1ce6140[: 134(1)
d3126ec4 was padded
end of seg: new tree, sequence# 0
 From d1ce6000 to d1e095c0
+d1ce60f8+
pinned_plug_p
pinned we need to extend to the next marked object
d1ce6130[
d1ce6130[(38)
adding 56 to gen2 surv
set_pinned_info(0xd1ce60f8,56,0x56b73760)
SANP: gen2, ptr; d1ce6140, limit: d1e17000
<br>

======================================
BAD PIN:
   plug                          : 0xd1ce60f8
   gen                           : 0x56b73760 (consign)
   generation_allocation_pointer : 0xd1ce6140
   generation_allocation_segment : 0xd1ce5000
                heap_segment_mem : 0xd1ce6000

static void WKS::gc_heap::set_allocator_next_pin(WKS::generation*): Assertion !((plug < generation_allocation_pointer (gen)) && (plug >= heap_segment_mem (generation_allocation_segment (gen))))

If the test runs without assertions it will later crash in process_ephemeral_boundaries() where nseg becomes NULL and crashes calling heap_segment_mem(nseg).

process_ephemeral_boundaries
...
nseg = heap_segment_next_rw (nseg);
generation_allocation_segment (consing_gen) = nseg;
generation_allocation_pointer (consing_gen) =
    heap_segment_mem (nseg);

I'll continue to investigate, but was curious if you folks could manage some pointers (ha! gc-jokes is there anything funnier).

Full log file

I've attached the tracing information from the plan_phase() File: log_47.txt.gz

Configuration

Linux x86

Regression?

This issue happens with any version of the GC.

Heap segments
Ephemeral heap segment = 0xd1ceb000
generation 0, heap segment 0xD1CEB000, starts at 0xD1CEC03C
generation 1, heap segment 0xD1CEB000, starts at 0xD1CEC000
generation 2, heap segment 0xF6C80000, starts at 0xF6C81000
       segment            begin          committed          reserved         allocated       size(allocated-begin)
    0xd1ceb000       0xd1cec000         0xd1dd5000        0xd20eb000        0xd1dcd2f4       0xe12f4(922356)
    0xf6c80000       0xf6c81000         0xf7080000        0xf7080000        0xf707fff4       0x3feff4(4190196)
    0xd3dff000       0xd3e00000         0xd41ff000        0xd41ff000        0xd41fefe8       0x3fefe8(4190184)
    0xd2bec000       0xd2bed000         0xd2fec000        0xd2fec000        0xd2febffc       0x3feffc(4190204)
    0xd38e3000       0xd38e4000         0xd3ce3000        0xd3ce3000        0xd3ce2e60       0x3fee60(4189792)
    0xd34e3000       0xd34e4000         0xd35c5000        0xd38e3000        0xd35c0a68       0xdca68(903784)
Large object heap starts at 0xf7081000
       segment            begin          committed          reserved         allocated       size(allocated-begin)
    0xf7080000       0xf7081000         0xf7452000        0xf7480000        0xf7451930       0x3d0930(4000048)
    0xf5a00000       0xf5a01000         0xf5af6000        0xf5e00000        0xf5af5260       0xf4260(1000032)

Other information

I understand that pinned objects are bad and should be avoided. I'm using the GC outside of CoreCLR, but will attempt to reproduce the same issue using CLR.

Dotnet-GitSync-Bot commented 4 years ago

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

benaadams commented 4 years ago

/cc @Maoni0

ghost commented 4 years ago

Tagging subscribers to this area: @Maoni0 Notify danmosemsft if you want to be subscribed.

Maoni0 commented 4 years ago

I'm using the GC outside of CoreCLR,

you mean you are using the GC dll with a different VM from coreclr?

I can't make a whole of sense from your debug output - a lot of this looked like added/modified by yourself. some observations - you mentioned "when an object gets pinned at or very near the beginning of a (gen 2) GC heap segment." but in the debug output the pin is very close to the beginning of the ephemeral segment. and I dunno which segment the plug that you are trying to call allocate_in_condemned is on (since d3126ec4 doesn't appear to be on any of the segment in your segment list).

regardless, the problem is you are allocating into the space that you have not gone through, ie, when allocate_in_condemned was called, it hasn't discovered the pin 0xd1ce60f8 on the ephemeral seg yet. so it allocated into the space that overlaps with the pin which obviously can't work. the question is why you are allocating into a space that you have not gone through because that would mean the compaction is actually taking up more space which should not happen. I would suggest to look from that perspective.

rimim commented 4 years ago

I'm using the GC outside of CoreCLR,

you mean you are using the GC dll with a different VM from coreclr? Correct.

I can't make a whole of sense from your debug output - a lot of this looked like added/modified by yourself. some observations - you mentioned "when an object gets pinned at or very near the beginning of a (gen 2) GC heap segment." but in the debug output the pin is very close to the beginning of the ephemeral segment. and I dunno which segment the plug that you are trying to call allocate_in_condemned is on (since d3126ec4 doesn't appear to be on any of the segment in your segment list).

I apologize for extra debug info. Mostly it was just extra lines to see which conditional paths were taken. When I said that the object was at the beginning of a heap segment I mean it just started looking at a new heap segment from the heap_segment_next_rw list. Shortly before the assertion you see this snippit:

end of seg: new tree, sequence# 0
 From d1ce6000 to d1e095c0

Which corresponds to this conditional block in plan_phase:

    if (x >= end)
    {
        assert (x == end);
        assert (heap_segment_allocated (seg1) == end);
        heap_segment_allocated (seg1) = plug_end;

        current_brick = update_brick_table (tree, current_brick, x, plug_end);
        dprintf (3, ("end of seg: new tree, sequence# 0"));
        sequence_number = 0;
        tree = 0;

        if (heap_segment_next_rw (seg1))
        {
            seg1 = heap_segment_next_rw (seg1);
            end = heap_segment_allocated (seg1);
            plug_end = x = heap_segment_mem (seg1);
            current_brick = brick_of (x);
            dprintf(3,( " From %Ix to %Ix", (size_t)x, (size_t)end));
            continue;
        }
        else
        {
            break;
        }
    }

regardless, the problem is you are allocating into the space that you have not gone through, ie, when allocate_in_condemned was called, it hasn't discovered the pin 0xd1ce60f8 on the ephemeral seg yet. so it allocated into the space that overlaps with the pin which obviously can't work. the question is why you are allocating into a space that you have not gone through because that would mean the compaction is actually taking up more space which should not happen. I would suggest to look from that perspective.

Thank you for your pointers. I'll continue to investigate.

rimim commented 4 years ago

Here is where things seem to go wrong during the gen2 plan_phase(). When scanning very near the end of full heap_segment it runs into a pinned object followed by marked objects. The plan_phase() needs to create a plug in the next heap_segment for the 12 byte pad.

For the problem to happen the pinned_plug_que_empty_p() must be true.

plug_start points to the old heap_segment (d3cff000-d40feffc).

allocate_in_condemned_generations() advances to the next heap_segment (d33c0000-d37bfef8) and updates the generation_allocation_pointer.

When it returns (plug_start - new_address) will be problematic as plug_start is in a different heap_segment.

Example heap_segment layout
--- Scanning this segment
====HEAP_SEGMENT : d3cff000-d40feffc
.....
d40fef98:20 [OBJECT] MARKED PINNED
d40fefac:20 [OBJECT] MARKED
d40fefc0:20 [OBJECT] MARKED
d40fefd4:20 [OBJECT] MARKED
d40fefe8:20 [OBJECT] MARKED
====END HEAP_SEGMENT

--- Advancing to this segment
====HEAP_SEGMENT : d33c0000-d37bfef8
d33c1000:12 [FREE]
d33c100c:20 [OBJECT] MARKED PINNED
d33c1020:20 [OBJECT] MARKED
d33c1034:12 [FREE]
d33c1040:20 [OBJECT] MARKED PINNED
d33c1054:20 [OBJECT] MARKED
......

plug_start = d40fefc0

calling allocate_in_condemned() gets a new_address

(c)[d40fefc0->d40feffc, NA: [d33c100c(13885364), d33c1048[: 3c(1)
                                       |BAD| <-- (plug_start - new_address)

At this point the generation_allocation_pointer() points to d33c1048, but when plan_phase() returns to the outer while loop:

while ((x < end) && marked (x))

The outer loop will start scanning from d33c1000 to d37bfef8 and when it tries to call set_pinned_info() for d33c100c it will assert because the plug is not in the range of the allocation segment (which was advanced above).

assert (!((plug < generation_allocation_pointer (gen)) &&
          (plug >= heap_segment_mem (generation_allocation_segment (gen)))));

Without the assertion it will often cause a heap corruption.

N.B.: I'm running in 32-bit without concurrency.

Maoni0 commented 4 years ago

it's perfectly legal for the new_address to be on a different segment. (plug_start - new_address) can be a large value - it's just there for informational purposes. the question is still, why would it be allocated on a segment that it has not gone through,therefore the pinned plug has not been discovered. when we are attempting to allocate at the end of a segment we make sure we don't pad at tail.

rimim commented 4 years ago

For the issue to manifest you need at least two gen2 heap_segments. The first heap_segment must be at capacity and cannot be grown. The heap_segment contains a pinned object followed by at least 2 (or more) marked objects and no free space. The last marked object must exist at the VERY END of the GC heap_segment. The following heap_segment needs a marked object (pinning isn't important) followed by some free space.

Then during a full GC the pinned plug will be extended to the first marked object, creating an extended plug. A normal plug is made of the second marked object. This creates a "saved_post_plug" before the second marked object - normal operation.

The issue comes with the second plug that gets logically allocated into the next heap_segment, since there is no gap to hold the plug in the current heap segment. This is essentially the example within p639 of the 'Pro .NET Memory Management' book, but the last/second plug is right up against the end of the heap_segment. If there is no room at the beginning of the next GC heap_segment (assume free/marked/free objects at the beginning of the next heap_segment but this second plug can't fit into the first free space), this second plug gets logically allocated in a heap_segment that has not been scanned yet. If assertions are enabled, then this condition gets asserted during the GC plan phase because heap corruption is likely to occur. A segfault occurs later within the GC plan (or compact) phase when scanning that next heap segment as we have an overlapping allocated area within the next heap_segment.

A simple hack to work around the issue (not a fix because it is incomplete and horrible) is to pin any object following a pinned object on the last page of gen2 heap_segment that has a following heap_segment and no trailing free space. With that hack in place I am no longer able to crash the GC nor trigger an assertion using my stress test. The hack is incomplete because the pin could happen before the last page and be followed by marked objects to the end of the heap_segment.

I'm going to see if I can figure out the correct solution, but wanted to share my findings with your team. So far I've been unsuccessful in reproducing the problem using C#. It's a pretty particular boundary condition, so it doesn't mean that it doesn't exist. Perhaps it's just very rare.

And yes not pinning anything also solves the issue.