Low overhead heap profiling

discostu105 commented 3 years ago

We do have a use-case for low overhead heap profiling (production ready, continuously on) that is currently hard to archive in .NET. I would like to start a conversation about what would be needed in .NET to achieve this and if there is a way forward to get such support in future .NET versions.

Use-Case

We do have a .NET Profiler (for APM and cpu-profiling use-cases) and want to extend it with memory/allocation profiling capabilities. We want to be able to tell our users, what code leads to expensive allocations, in production environment, continuously-on, low-overhead.

What we would like to capture:

Allocated type
Allocation size (object size, array size+base object size)
Callsite of an allocation (full callstack)
Identify allocations that lead to “surviving” objects (e.g. through tagging objects and get a "free" callback for them)

It shall have the following properties:

Can be enabled/disabled at runtime
Has a controllable sampling rate (so that the desired overhead/accuracy can be dynamically adapted, based on user-config or overhead estimation heuristics)

Here is an example of how such data could be visualized: https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/analysis/memory-profiling/

Status Quo

We have researched multiple approaches, but none of them fully satisfied our requirements.

One approach is to use the ObjectAllocated, MovedReferences, SurvivedReferencesand GarbageCollectionFinished profiler callbacks. However, this is not viable for production scenarios, since the performance overhead for just enabling these callbacks is extremely high (more than 100%).

Since .NET 5 we can also use the EventPipeEventDelivered profiler callback. There are the AllocationTick_V3, GCBulkMovedObjectRanges and GCBulkSurvivedObjectRanges event pipe events that provide similar data as the profiler callbacks mentioned above. The measured overhead for this was significantly lower (Between ~1% to ~20%, depending on the number of allocations and GC runs. The 20% overhead was measured with a sample that allocates large arrays in a loop. For more realistic applications this overhead is closer to ~2%).

Problems of the event pipe approach:

Obtaining size of allocated arrays not possible
AllocationTick_V3 sampling rate fixed at ~100KB (problematic for applications that allocate very low/high amounts of memory)
Overhead still higher than in java

Array Size

Array size is critical for our use-case, as arrays can make up a significant portion of overall allocations. As mentioned in https://github.com/dotnet/runtime/issues/43345, it is not possible to obtain the size of allocated array objects in the callback of the AllocationTick_V3 event.

We can obtain the size in the GarbageCollectionStarted profiler callback with the ICorProfilerInfo::GetObjectSize method if we track the ObjectId. However, enabling this profiler callback increases the overhead significantly.

The GCStart event pipe event would have less overhead, however it is not possible to reliably obtain the object size in that callback, since the ICorProfilerInfo::GetObjectSize method sometimes fails with a read access violation at:

coreclr.dll!Object::GetSize() Line 44 coreclr.dll!ProfToEEInterfaceImpl::GetObjectSize(unsigned __int64 objectId, unsigned long * pcSize) Line 1586

Comparable solutions

Since JDK 11, there are callbacks that provide the necessary information with minimal overhead. It matches our use-case really well.

It is possible to monitor allocated objects with the SampledObjectAlloc callback (https://docs.oracle.com/en/java/javase/11/docs/specs/jvmti.html#SampledObjectAlloc). The sampling rate for this callback can be configured with the SetHeapSamplingInterval method.

Additionally, there is the ObjectFree callback that is sent when a tagged object is freed by the garbage collector (https://docs.oracle.com/en/java/javase/11/docs/specs/jvmti.html#ObjectFree).

A detailed description of this can be found at https://openjdk.java.net/jeps/331

Summary

Currently it looks like our use-case cannot be fulfilled in .NET. With this ticket, we're hoping to have a discussion if such a capability makes sense in a future .NET version. If this isn't the right place/form to have such a discussion, please let us know :).

@discostu105 @d-schneider

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details

We do have a use-case for low overhead heap profiling (production ready, continuously on) that is currently hard to archive in .NET. I would like to start a conversation about what would be needed in .NET to achieve this and if there is a way forward to get such support in future .NET versions. ## Use-Case We do have a .NET Profiler (for APM and cpu-profiling use-cases) and want to extend it with memory/allocation profiling capabilities. We want to be able to tell our users, what code leads to expensive allocations, in production environment, continuously-on, low-overhead. What we would like to capture: * Allocated type * Allocation size (object size, array size+base object size) * Callsite of an allocation (full callstack) * Identify allocations that lead to “surviving” objects (e.g. through tagging objects and get a "free" callback for them) It shall have the following properties: * Can be enabled/disabled at runtime * Has a controllable sampling rate (so that the desired overhead/accuracy can be dynamically adapted, based on user-config or overhead estimation heuristics) Here is an example of how such data could be visualized: https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/analysis/memory-profiling/ ## Status Quo We have researched multiple approaches, but none of them fully satisfied our requirements. One approach is to use the `ObjectAllocated`, `MovedReferences`, `SurvivedReferences `and `GarbageCollectionFinished` profiler callbacks. However, this is not viable for production scenarios, since the performance overhead for just enabling these callbacks is extremely high (more than 100%). Since .NET 5 we can also use the `EventPipeEventDelivered` profiler callback. There are the `AllocationTick_V3`, `GCBulkMovedObjectRanges` and `GCBulkSurvivedObjectRanges` event pipe events that provide similar data as the profiler callbacks mentioned above. The measured overhead for this was significantly lower (Between ~1% to ~20%, depending on the number of allocations and GC runs. The 20% overhead was measured with a sample that allocates large arrays in a loop. For more realistic applications this overhead is closer to ~2%). Problems of the event pipe approach: * Obtaining size of allocated arrays not possible * `AllocationTick_V3` sampling rate fixed at ~100KB (problematic for applications that allocate very low/high amounts of memory) * Overhead still higher than in java ### Array Size As mentioned in https://github.com/dotnet/runtime/issues/43345, it is not possible to obtain the size of allocated array objects in the callback of the `AllocationTick_V3` event. We can obtain the size in the `GarbageCollectionStarted` profiler callback with the `ICorProfilerInfo::GetObjectSize` method if we track the `ObjectId`. However, enabling this profiler callback increases the overhead significantly. The `GCStart` event pipe event would have less overhead, however it is not possible to reliably obtain the object size in that callback, since the `ICorProfilerInfo::GetObjectSize` method sometimes fails with a read access violation at: `coreclr.dll!Object::GetSize() Line 44` `coreclr.dll!ProfToEEInterfaceImpl::GetObjectSize(unsigned __int64 objectId, unsigned long * pcSize) Line 1586` ## Comparable solutions Since JDK 11, there are callbacks that provide the necessary information with minimal overhead. It is possible to monitor allocated objects with the `SampledObjectAlloc` callback (https://docs.oracle.com/en/java/javase/11/docs/specs/jvmti.html#SampledObjectAlloc). The sampling rate for this callback can be configured with the `SetHeapSamplingInterval` method. Additionally, there is the `ObjectFree` callback that is sent when a tagged object is freed by the garbage collector (https://docs.oracle.com/en/java/javase/11/docs/specs/jvmti.html#ObjectFree). A detailed description of this can be found at https://openjdk.java.net/jeps/331 ## Summary Currently it looks like our use-case cannot be fulfilled in .NET. With this ticket, we're hoping to have a discussion if such a capability makes sense in a future .NET version. If this isn't the right place/form to have such a discussion, please let us know :). @discostu105 @d-schneider

Author:	discostu105
Assignees:	-
Labels:	`area-GC-coreclr`, `untriaged`
Milestone:	-

davmason commented 3 years ago

Hi @discostu105,

Thanks for reaching out to start the discussion.

The GC profiling available over ICorProfiler has historically been extremely performance impactful, and we have done some work over the last couple releases to try and make it better. I'm happy to work with you to identify ways we can iterate even more.

Just to make sure you're aware of the recent work I'll mention it here:

We first added the concept of "lightweight GC Profiling" that enables just GC start and end and updates of generational bounds: https://github.com/dotnet/coreclr/pull/22866

Then we added "medium weight GC profiling" providing some APIs to be able to track objects more efficiently: https://github.com/dotnet/coreclr/pull/24156

And then as you mention in 5.0 we added the ability to get EventPipe events over ICorProfiler.

For the specific issues you point out, I have a couple follow up questions.

We can obtain the size in the GarbageCollectionStarted profiler callback with the ICorProfilerInfo::GetObjectSize method if we track the ObjectId. However, enabling this profiler callback increases the overhead significantly.

Have you tried getting GC start events with lightweight GC profiling enabled? Hopefully that should be very little overhead.

There also is the option of using only the EventPipe events and doing GC profiling the same way the dotnet team's tools (e.g. PerfView) already do GC profiling. I am not an expert in them, but can help wade through the details if you want to go that route.

Overhead still higher than in java

Can you give specific numbers for what the overhead difference is when collecting the same type of data?

d-schneider commented 3 years ago

Hi @davmason,

thank you for the link to the lightweight GC profiling. It is indeed possible to obtain the size of array objects in the GC started callback with the COR_PRF_HIGH_BASIC_GC event mask enabled. As you mentioned, this has very low overhead. Enabling it in addition to the mentioned event pipe events adds no significant overhead.

However, array-sizes can only be obtained at collection-time, not at allocation-time. There are situations where few garbage collections occur. In that case we must wait for the next garbage collection before we can obtain the size of array objects. This makes it more complex if we want to report the amount of allocated memory in a certain timeframe, since it is possible that we don’t know the sizes of allocated arrays at the end of the reporting-timeframe, if there was no GC run. So, while it’s not ideal, it certainly already helps us to get array-sizes most of the time, especially in times with high GC activity, which are the interesting situations anyway.

In an ideal world though, it would be preferable if we could obtain the object size of the last allocated object during the EventPipeEventDelivered callback for the AllocationTick event (e.g., by including it in the event data). This is currently not possible for array objects, as mentioned in https://github.com/dotnet/runtime/issues/43345.

Concerning overhead, there are two main differences between Java and .NET that can influence it:

We can configure the sampling rate for the Java SampledObjectAlloc callback during runtime, while the sampling rate for the .Net AllocationTick event is fixed at ~100KB. With the configurable sampling rate, it’s possible to dynamically adapt overhead to an acceptable level.
In Java it is possible to tag specific objects and obtain an ObjectFree callback when this object is freed by the garbage collector. For .NET, we need the GCBulkSurvivedObjectRanges and GCBulkMovedObjectRanges event pipe events to determine if an object survived the GC run. These events add high overhead if there are many garbage collections. In extreme cases e.g., when allocating a lot of large arrays, this can add ~15% overhead.

Our Java solution generally adds very low overhead (lower than 1%). Compared to that, just enabling the necessary event pipe events described in the original post has a performance overhead of ~1-20%, depending on the number of allocations, amount of allocated memory and number of GC runs.

Maoni0 commented 3 years ago

@d-schneider I'd definitely be interested to see how we can make this better for you. re your questions

With the configurable sampling rate, it’s possible to dynamically adapt overhead to an acceptable level.

this can definitely be a config instead of hard coded 100k.

In an ideal world though, it would be preferable if we could obtain the object size of the last allocated object during the EventPipeEventDelivered callback for the AllocationTick event

if you meant could we give you the size of the object that happened to trigger the AllocTick event, that's totally doable - GC has this info when an allocation triggered a GC.

In Java it is possible to tag specific objects and obtain an ObjectFree callback when this object is freed by the garbage collector.

can you please tell me a bit about your usage of this callback? I presume you are doing this all in native code, just like with .NET. do you normally register for this callback with say a few user specified objects? I can see how the overhead would totally go up if there were many objects that registered for this callback.

davmason commented 3 years ago

In an ideal world though, it would be preferable if we could obtain the object size of the last allocated object during the EventPipeEventDelivered callback for the AllocationTick event

if you meant could we give you the size of the object that happened to trigger the AllocTick event, that's totally doable - GC has this info when an allocation triggered a GC.

From a native ICorProfiler implementation point of view, the issue is that the event is fired before the MethodTable/array size is set on the object, so you can't call GetArrayObjectInfo on it from the callback to the AllocationTick event (since EventPipe events are synchronous for ICorProfiler). Either including the size in the event or moving the event so it is fired after the object is published should work for this case.

d-schneider commented 3 years ago

Thanks for the replies!

This can definitely be a config instead of hard coded 100k.

It would be great for us if the sampling rate for the AllocationTick event is configurable at runtime. As mentioned, this would help us to dynamically reduce overhead or increase accuracy in situations with few allocations.

if you meant could we give you the size of the object that happened to trigger the AllocTick event, that's totally doable - GC has this info when an allocation triggered a GC.

Yes, I meant the size of the object that triggered the AllocationTick event.

Either including the size in the event or moving the event so it is fired after the object is published should work for this case.

Both solutions would be ok for us.

can you please tell me a bit about your usage of this callback? I presume you are doing this all in native code, just like with .NET. do you normally register for this callback with say a few user specified objects? I can see how the overhead would totally go up if there were many objects that registered for this callback.

Our Java implementation is also done in native code. We register each object that triggered a SampledObjectAlloc callback for the ObjectFree callback. This way, the number of ObjectFree callbacks sent is at most as high as the number of SampledObjectAlloc callbacks.
Since we can adjust the sampling rate for the SampledObjectAlloc callback during runtime, it is possible to target a certain number of sampled allocations per timeframe (e.g., 1000 sampled allocations per minute). To reduce overhead, we only track objects for one timeframe and the object tag, which determines if an ObjectFree callback should be sent, is cleared for all remaining survivors after that timeframe.

Maoni0 commented 3 years ago

@d-schneider

Both solutions would be ok for us.

ahh, my question was if you just wanted the size, or if you needed it to be an object that's already constructed. if it's just the size that's trivial 'cause GC already knows the size. but if you need this to be a constructed object (eg, you can call some method on this object), that would require the event to be moved as @davmason mentioned - the place where it's fired now is in GC before the methodtable is filled in.

regarding the ObjectFree callback, you could implement this via GC handles. you can allocate a weak GC handle to hold onto objects of interest and during the GC done callback check if they are nulled by the GC, if so you know they are dead. obviously this requires you to be able to allocate a GC handle in your code. so if you currently already have some way to do that (ie, you already have managed code running and can pass a delegate back to native code to reverse pinvoke to create a GC handle to hold onto these objects), that's great; if not, it'd be some work to get this managed code infra running first. it's possible to make the profiling API provide this plumbing for you (but that'd be work on the diagnostics team :)).

davmason commented 3 years ago

@Maoni0 are the first two items (reporting size in the alloc event and configurable alloc tick frequency) things the GC team would take on?

I'm happy to work with @d-schneider on how to best achieve the object tracking from the profiler.

Maoni0 commented 3 years ago

@davmason yes, I don't feel like I have a confirmation whether this event would need to be moved though (it'd be really great to avoid it 'cause it means the code has to move from the GC side to the VM side).

d-schneider commented 3 years ago

ahh, my question was if you just wanted the size, or if you needed it to be an object that's already constructed.

@Maoni0 Just the size is sufficient for our use case.

Concerning the AllocationTick sampling rate: The Java SampledObjectAlloc callback also uses a random variation for the sampling frequency, as described in JEP-331. If possible, a similar feature for the AllocationTick event would also be interesting for us.

Description for this from https://openjdk.java.net/jeps/331: "Note that the sampling interval is not precise. Each time a sample occurs, the number of bytes before the next sample will be chosen will be pseudo-random with the given average interval. This is to avoid sampling bias; for example, if the same allocations happen every 512KB, a 512KB sampling interval will always sample the same allocations. Therefore, though the sampling interval will not always be the selected interval, after a large number of samples, it will tend towards it."

Maoni0 commented 3 years ago

thanks for the info @d-schneider. have you observed that the random interval is needed often? in theory it sounds like a useful thing but in practice it should be completely rare that "same allocations happen every 512KB" - even if that happened, since we are almost always in a multi-threaded environment this means to the GC it won't see the same allocation every 512kb (ie, one thread could be doing the same alloc every 512kb but since it shares the same heap with another thread, GC won't see that alloc every 512kb on that heap).

discostu105 commented 3 years ago

@Maoni0 We don't really have data on how significant the bias would be without the random interval, as this is a built-in JVM feature that cannot be disabled.

Maoni0 commented 3 years ago

@discostu105 then I would vote to not include this in our system 'cause I simply don't see it having a practical usage.

davmason commented 3 years ago

@discostu105 or @d-schneider, do you want to talk about the object tracking portion of your request?

Maoni has a great idea to use weak references, if you are already doing IL rewriting then it would be not that much work. I'm also happy to discuss adding a new API to ICorProfiler*, but then it would only be available in .net 6 or 7 and newer, depending on when it lands.

d-schneider commented 3 years ago

@davmason @Maoni0

Maoni has a great idea to use weak references, if you are already doing IL rewriting then it would be not that much work.

As I understand it, we would have to do a reverse pinvoke in the AllocationTick and GarbageCollectionFinished callbacks to allocate the GC handles and to check if they were nulled by the GC respectively. However, when trying this I ran into problems when calling the delegate:

In the case of the GarbageCollectionFinished callback the managed thread of my sample app and the thread of the GC finished callback where I call the delegate hang indefinitely.
In the EventPipeEventDelivered callback for the AllocationTick event the process crashes with a "Fatal error. Invalid Program: attempted to call a UnmanagedCallersOnly method from managed code.". I am not sure why this error occurs, but it could be because we get the AllocationTick callback on the managed thread where the allocation occurred.

Is there anything special to consider for the reverse pinvoke in those cases that I might have missed?

The reverse pinvoke does work in a native worker thread, but then there could be race conditions e.g., a GC run between the allocation and when the worker thread creates a GC handle. We would have to wait in the AllocationTick callback for the worker thread to finish creating a GC handle, but this is not an optimal solution.

I haven't tried this yet, but another question is if there could be any problems when creating the GC handle for the object that triggered the AllocationTick event, considering that we currently can't get the size of the object in that callback?

I'm also happy to discuss adding a new API to ICorProfiler*, but then it would only be available in .net 6 or 7 and newer, depending on when it lands.

We think the native ICorProfiler API would be the better approach, as it would be simpler to consume. Preferably similar to Java, e.g., we can register an object for a callback when it's freed by the GC. Getting this added in a future .NET release would be great! We are happy to answer any questions regarding a possible ICorProfiler API.

davmason commented 3 years ago

As I understand it, we would have to do a reverse pinvoke in the AllocationTick and GarbageCollectionFinished callbacks to allocate the GC handles and to check if they were nulled by the GC respectively. However, when trying this I ran into problems when calling the delegate:

In the case of the GarbageCollectionFinished callback the managed thread of my sample app and the thread of the GC finished callback where I call the delegate hang indefinitely.

Yeah, that makes sense. The GC is still considered in progress during the GarbageCollectionFinished callback, so managed code won't be able to run until you return from it and let the GC complete.

In the EventPipeEventDelivered callback for the AllocationTick event the process crashes with a "Fatal error. Invalid Program: attempted to call a UnmanagedCallersOnly method from managed code.". I am not sure why this error occurs, but it could be because we get the AllocationTick callback on the managed thread where the allocation occurred.

This makes sense, the AllocationTick event is going to be fired in the middle of the allocation, which would be in managed code. So even though your profiler is native code, there is managed code on the stack so it triggers that error.

Is there anything special to consider for the reverse pinvoke in those cases that I might have missed?

The reverse pinvoke does work in a native worker thread, but then there could be race conditions e.g., a GC run between the allocation and when the worker thread creates a GC handle. We would have to wait in the AllocationTick callback for the worker thread to finish creating a GC handle, but this is not an optimal solution.

I hadn't thought through exactly how you would have to accomplish this, but you're right that there are a lot of potential race conditions and deadlocks. I think the only way you could accomplish it right now is how you describe it, you would have to spin up a separate thread that has no managed code on it, pass the object to the thread and then block in the AllocationTick event callback until the other thread is done allocating a handle to it.

If you go that route, you would have to be very careful to not do any allocations, and not call any methods that allocate. Since you would blocking inside an allocation, it would prevent a GC from running and any allocation can trigger a GC (that would lead to a deadlock).

I haven't tried this yet, but another question is if there could be any problems when creating the GC handle for the object that triggered the AllocationTick event, considering that we currently can't get the size of the object in that callback?

I don't think there will be any issues with that.

I'm also happy to discuss adding a new API to ICorProfiler*, but then it would only be available in .net 6 or 7 and newer, depending on when it lands.

We think the native ICorProfiler API would be the better approach, as it would be simpler to consume. Preferably similar to Java, e.g., we can register an object for a callback when it's freed by the GC. Getting this added in a future .NET release would be great! We are happy to answer any questions regarding a possible ICorProfiler API.

After thinking about this for a while, I think it would make sense to add a general purpose GC handle API to ICorProfiler - profilers could allocated weak handles to track object lifetime like you want to do, but then could also allocate a strong handle to keep objects alive that they want to keep alive. It wouldn't give you a callback, but it would be more general purpose and provide benefit to more scenarios.

    typedef enum
    {
        COR_PRF_HANDLE_TYPE_STRONG,
        COR_PRF_HANDLE_TYPE_WEAK,
        COR_PRF_HANDLE_TYPE_PINNED
    } COR_PRF_OBJECT_HANDLE_TYPE;

    HRESULT AllocateHandle(
                [in] ObjectID objectID,
                [in] COR_PRF_OBJECT_HANDLE handleType,
                [out] ObjectHandle *pObjectHandle);

    HRESULT FreeHandle([in] ObjectHandle objectHandle);

    HRESULT GetObjectFromHandle(
                [in] ObjectHandle handle,
                [out] ObjectID *pObjectID);

d-schneider commented 3 years ago

@davmason Thanks for the API proposal. The described API would be great for our use case.

How high is the expected performance impact of calling the GetObjectFromHandle method multiple times per GC run? We would call it for each of our tracked objects after each GC run until it is freed or the tracked objects are cleared (happens every minute). This could be problematic in scenarios with many GC runs.

Is it possible that a variant of the GetObjectFromHandle method that allows us to get multiple objects at once would be better from a performance perspective for this use-case?

jkotas commented 3 years ago

GetObjectFromHandle would be very cheap. The implementation is going to be:

HRESULT GetObjectFromHandle(
                [in] ObjectHandle handle,
                [out] ObjectID *pObjectID)
{
    if (handle == NULL || pObjectID == NULL) return E_INVALIDARG;
    *pObjectID = *(void**)handle;
}

Unless you are at the point where you are trying to micro-optimize at instruction level, the cost is negligible.

d-schneider commented 3 years ago

GetObjectFromHandle would be very cheap.

Thanks for the info about the GetObjectFromHandle implementation. Then the proposed API would be great for our use case.

davmason commented 3 years ago

Thanks for the confirmation @d-schneider.

I don't think I said this explictly so far, there is about a month or two left to get features in for 6.0 and we are already completely booked on the diagnostics team. This feature would be scheduled for 7.0 at the earliest as it stands.

That being said, we always welcome PRs from the community and this is probably one of the easier ones to implement. If you or anyone on your team is feeling up for it I would be more than happy to guide you through the process of implementing it.

Maoni0 commented 3 years ago

@d-schneider while reviewing my instrumentation change #55888, @noahfalk brought up something that I hadn't thought of and wanted to check with you. in my PR I made the alloc tick threshold configurable via a runtime config (which can also be set as an env var), but he pointed out that it may not produce desirable effect for you because a profiler wouldn't have the freedom to do this config on the user's behalf and you probably meant a profiling API for you to set this threshold instead? could you please confirm which is your preference?

I presume you still would like the object size as part of the alloc tick regardless, right? which the new version of the event provides.

d-schneider commented 3 years ago

@Maoni0 Thanks for implementing this! We would prefer a profiling API for this configuration. It is also important for us that we can adjust the allocation tick threshold with this API while the application is running.

Yes, we would still like the object size as part of the allocation tick event. The new AllocationTick_V4 event looks great in this regard!

Maoni0 commented 3 years ago

@d-schneider thanks for confirming! that's the same as what @noahfalk told me. I've pulled out the runtime config and kept the new AllocationTick_V4 event in my PR. for adjusting the threshold with profiling API, the diagnostics team will handle that (@davmason @noahfalk). it shouldn't be hard to add it and allow it to change the threshold while the process is running.

noahfalk commented 9 months ago

Just a heads up that over in https://github.com/dotnet/runtime/pull/98167 I am starting to look into low overhead randomized heap sampling again.

dotnet / runtime