Implicit GC.KeepAlive in classes with finalizers

EgorBo commented 1 month ago

Normally, the fact that GC is precise has no visible impact on user's code in terms of correctness, however, there are certain scenarios where developer might be expected to keep an object alive by rooting it or using GC.KeepAlive and may hit rarely reproducible bugs otherwise. It typically happens with finalizers, for example:

public class MyClass
{
    public IntPtr Handle;

    public void DoWork()
    {
        // E.g. some pinvoke
        Native.DoSomeWork(Handle);
    }

    ~MyClass()
    {
        Native.DestroyHandle(Handle); // can be invoked before DoSomeWork finishes!
    }
}

// Execution example:
new MyClass().DoWork();

Here it's not obvious that DestroyHandle(Handle) can be legally invoked while DoSomeWork(Handle) is still being executed (or even not started yet!), it happens because of the precise GC (FullOpts/Tier1) that may mark "this" object as dead after Handle field is accessed - DoWork method doesn't even have to be inlined. The only way to prevent this behavior is to add GC.KeepAlive(this); after DoSomeWork(Handle);

Numerous people were not aware about this, IMO, not obvious behavior and suspected their code is subject to potential bugs (typically, "use after free"-like). Afair, @rickbrew had to insert a lot of such GC.KeepAlive(this); in Paint.NET application, also, @kekekeks suspects they might have potential bugs in Avalonia due to this behavior.

Here is the minimal repro: https://gist.github.com/EgorBo/f30e159c634f40b82b4edc78cf373466

Proposed solution

We can artificially extend life of this pointer in all instance methods of classes with finalizers by, effectively, emitting GC.KeepAlive(this) before all returns in instance methods to significantly reduce number of potential issues in users code at a small cost of making some objects live longer than needed. It's not supposed to be a silver bullet as there are still cases where GC.KeepAlive(obj) (or rooting an object) may still be required. It might be implemented on Roslyn side (so other runtimes with precise GC won't step on this too) or in JIT's liveness analysis.

cc @jkotas

jkotas commented 1 month ago

I would say that this is a job for interop generators to worry about. This problem only exists in manually written low-level unsafe interop. Manually written 100% correct low-level unsafe interop is hard. I do not think that this makes it significantly simpler.

at a small cost of making some objects live longer than needed

It is not the only cost. It also increases register pressure a bit, prevents tail call optimizations, ... .

jkotas commented 1 month ago

The only way to prevent this behavior is to add GC.KeepAlive(this); after DoSomeWork(Handle);

Note that this is not 100% reliable solution. You can still get into trouble because resurrection. It is why we have switched BCL from HandleRef that is an explicit scheme to automatically inserts KeepAlives to SafeHandles that are 100% robust.

jkotas commented 1 month ago

It is why we have switched BCL from HandleRef that is an explicit scheme to automatically inserts KeepAlives to SafeHandles that are 100% robust.

If we were to build this feature, we would not be able to switch BCL interop from SafeHandles to the implicit KeepAlives. It would not pass security review. It begs the question why to build this interop feature when it is not good enough for BCL?

rickbrew commented 1 month ago

Objects being finalized while methods are still executing on them is completely baffling behavior when you discover it, and it just feels like a bug (in .NET/C# itself). It's not a bug in the implementation, so I'd argue it's a bug in the design (IMO of course).

Sure it (mostly) only affects low-level interop code, but a lot of code bases need that sort of thing and it's not the exclusive domain of interop experts. Even for "interop experts" it's very easy to forget that you need to use GC.KeepAlive(this), it's easy to miss during code reviews, it won't reveal itself as a bug except via random low-firing crashes, and it's exceptionally difficult to debug. I'm not sure how it could be measured, but I bet there are a lot of low-firing random/mysterious/spooky crashes all over the ecosystem that are caused by this.

In Paint.NET I solved this structurally by using an "acquire self handle" pattern in my custom COM wrappers. I'm also using a custom source generator so I don't have to write each one of these "self handles" myself. This does introduce an implicit try/finally due to using, so it hurts performance a smidge. I'm able to toss in an IsDisposed check as well, adding an extra layer of runtime state validation, so it wasn't an entirely wasted effort even if GC.KeepAlive(this) were to become automatic.

If GC.KeepAlive(this) were implicit/automatic, then I suspect that my code size would shrink a little, and performance would improve ... also a little. Maybe not enough to measure.

[ComObjectRefImpl<ID2D1Mesh>]
internal unsafe partial class D2D1Mesh
    : D2D1Resource,
      IDeviceMesh
{
    public D2D1Mesh(ID2D1Mesh* pObject)
        : base((ID2D1Resource*)pObject)
    {
        // The pointer is stashed by the base-most `ComObjectRef` class and is only accessible via the "self handle"
        // This makes it difficult to accidentally not use the "self handle"
    }

    public ITessellationSink Open()
    {
        using var self = AcquireSelfHandle(); // <-- acquire

        using ComPtr<ID2D1TessellationSink> spSink = default;
        HRESULT hr = self.pD2D1Mesh->Open(spSink.GetAddressOf()); // <-- use
        hr.ThrowOnError();

        return ComObjectRef.CreateRef<ITessellationSink>(spSink).NotNull();
    }

    // vvvvvvv Generated code (some of it anyway) vvvvvvv
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    protected new SelfHandle AcquireSelfHandle()
    {
        return new SelfHandle(this);
    }

    protected new readonly ref struct SelfHandle
        //: IDisposable
    {
        private readonly D2D1Mesh self;

        // vvvvvvv This is where you get access to the native pointer
        public TerraFX.Interop.DirectX.ID2D1Mesh* pD2D1Mesh
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get => (TerraFX.Interop.DirectX.ID2D1Mesh*)this.self.DangerousGetNativeObjectNoAddRef();
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public SelfHandle(D2D1Mesh self)
        {
            self.VerifyNotDisposed();
            this.self = self;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public void Dispose()
        {
            GC.KeepAlive(this.self); // I don't actually know if this accomplishes anything
        }
    }
}

AaronRobinsonMSFT commented 1 month ago

This is something every interop developer eventually hits. I'm not sure how we can fix this and instead I think the correct mitigation is to document in our https://learn.microsoft.com/dotnet/standard/native-interop/best-practices, but not attempt to hide from the user.

cshung commented 1 month ago

It appears that the best practice doc says nothing about finalizers, which is good because using finalizer is not really a good practice, it complicates things, a lot. Non-deterministic releases of resources , unnecessarily elongating the life time of objects, non determinism due to threading, etc. It just makes thing so much harder than doing the right thing upfront.

I wonder where does the idea of using finalizer to clean up comes from, we should probably stop teaching that. Maybe we should talk about preferring deterministic resource release through the IDisposablepattern over finalizer somewhere?

AaronRobinsonMSFT commented 1 month ago

It appears that the best practice doc says nothing about finalizers

Maybe we should talk about preferring deterministic resource release through the IDisposablepattern over finalizer somewhere?

Correct. Whenever the interop team is involved, finalizers are specifically discouraged for non-BCL related implementations. Instead, the IDisposable pattern is recommended.

This can be observed in the latest large interop API, ComWrappers. There are two modes that are intentionally distinct. The first is the non-deterministic finalizer approach that enables the most common scenario where users rely on the GC to handle clean-up. The second is the deterministic approach, using CreateObjectFlags.UniqueInstance, that allows usage of IDisposable.

Adding this perspective to the aformentioned best practices doc makes a lot of sense here.

hez2010 commented 1 month ago

The most important thing is that: you won't be able to reproduce it in Debug configuration as the lifetime will be extended to the end of method implicitly. This makes it really hard to diagnostic an issue caused by missing GC.KeepAlive as it's difficult to reproduce and locate a bug like this from a bug report in production, and you won't even be aware of such bugs during development. IMO we should at least make the behavior of Debug the same with Release: the lifetime end after the last use to this.

cshung commented 1 month ago

I think this is why people start using finalizer as the resource clean up mechanism in the first place.

The first is the non-deterministic finalizer approach that enables the most common scenario where users rely on the GC to handle clean-up.

If I knew nothing about what to do, I would skim through the document, read the first thing, you told me that is the most common way of doing thing, I will do just that.

Do we know that using finalizer to do resource cleanup is really common?
Does the complexity of the problem really calls for it?
Do they know the consequences of the non-determinism?

Change the wordings and I think it might do some good:

Make it the second approach
Call that a rare situation, where deterministic release of resource is either infeasible or too complicated, and
Call out that is an advanced approach, many gotchas (like this issue), only for advanced users.

Should I think of CreateObjectFlags.UniqueInstance similar to unique_ptr? Do we have a shared_ptr variant of that? I wonder if we can make releasing through finalizer the truly last resort.

Fundamentally, managing life time of native object is a pain for managed developer. It is really just type 1 pain (i.e. figure out the logic so that we can deterministically release it) or type 2 pain (non-determinism and race condition). Type 2 pain feels way more painful than type 1, IMO.

AaronRobinsonMSFT commented 1 month ago

If I knew nothing about what to do, I would skim through the document, read the first thing, you told me that is the most common way of doing thing, I will do just that.

Yes, that is the expectation. Letting the GC handle object lifetime should always be the preference. It costs more to implement, but is more performant and less prone to use-after-free scenarios and corruption. In effect, the entire point of garbage collectors.

Specifically for COM, there are a few niche cases where not relying on the GC is needed. The most common involve tooling for tracking memory usage of native COM components and when COM components hold onto resources, think of file locks. Both rely on predictable lifetimes so options are needed. All other cases should let the GC manage the lifetime of the interop projection, hence the default case is what we expect users to use. Manual management via IDisposable in COM source generators or Marshal.[Final]ReleaseComObject() for built-in COM are mechaisms for allowing users to handle the less common cases.

Encouraging users, by default, to try and manually handle lifetimes with the GC is not appropriate for the .NET ecosystem. That is why we've spent multiple developer years constructing abstractions to avoid that. When users need bespoke solutions they need to reference best-practices guides and realize there is going to be pain in developing those solutions. See the effort being made for Swift interop, again we are creating abstractions to avoid the lifetime complexity in the common cases, but are providing APIs when manual management is needed.

Should I think of CreateObjectFlags.UniqueInstance similar to unique_ptr?

Sort of. It is slightly different though because RAII semantics are difficult to avoid in C++, but not calling IDisposable is easy to miss.

Do we have a shared_ptr variant of that? I wonder if we can make releasing through finalizer the truly last resort.

That is the GC and COM AddRef / Release semantics in general. The finalizer is and should remain the best practice for COM, which is why the runtime team takes great pain to audit and maintain it - so users can rely upon it. This is also true for Objective-C interop and Swift will likely have some reliance on it.

lsoft commented 1 month ago

As a reserve way, can we write a Roslyn analyzer to track such use cases? The generator will not provide a 100% safety, but something is better than nothing.

public class MyClass
{
    public IntPtr Handle;

    public void DoWork()
    {
        // E.g. some pinvoke
        Native.DoSomeWork(Handle);
    }

    ~MyClass()
    {
        Native.DestroyHandle(Handle); // can be invoked before DoSomeWork finishes!
    }
}

hez2010 commented 1 month ago

IMO the way should be to align the behavior of the lifetime between debug and release so that developers can find the issue earlier in the debug build during development, instead of all going well during development but the issue suddenly appears after deploying the app to the production environment.

tfenise commented 1 month ago

No. In debug mode, lifetime of local variables should be extended to the whole method, so that when a breakpoint at the end of a method is encountered, this and other reference type local variables can still be inspected by the debugger.

CyrusNajmabadi commented 1 month ago

As a reserve way, can we write a Roslyn analyzer to track such use cases? The generator will not provide a 100% safety, but something is better than nothing.
public class MyClass
{
    public IntPtr Handle;

    public void DoWork()
    {
        // E.g. some pinvoke
        Native.DoSomeWork(Handle);
    }

    ~MyClass()
    {
        Native.DestroyHandle(Handle); // can be invoked before DoSomeWork finishes!
    }
}

Yup. You definitely could write an analyzer for this.

jakobbotsch commented 1 month ago

We can artificially extend life of this pointer in all instance methods of classes with finalizers by, effectively, emitting GC.KeepAlive(this) before all returns in instance methods to significantly reduce number of potential issues in users code at a small cost of making some objects live longer than needed. It's not supposed to be a silver bullet as there are still cases where GC.KeepAlive(obj) (or rooting an object) may still be required. It might be implemented on Roslyn side (so other runtimes with precise GC won't step on this too) or in JIT's liveness analysis.

It seems the JIT already knows how to do this due to generics or synchronized methods requiring it, so it probably wouldn't be too hard to check the asmdiffs impact of the change. Look for lvaKeepAliveAndReportThis.

EgorBo commented 1 month ago

The fix is indeed trivial on jit side, thanks to @jakobbotsch's suggestion

Jit-diffs: https://github.com/MihuBot/runtime-utils/issues/463, dominated by System.Data (also keep in mind that this is PMI mode where jit-utils instanciate all generic methods with a set of predefined types)

IMO the way should be to align the behavior of the lifetime between debug and release so that developers can find the issue earlier in the debug build during development, instead of all going well during development but the issue suddenly appears after deploying the app to the production environment.

I don't think it's possible, even in Release this specific case may reproduce only once a week since GC has to happen in an exact moment to reproduce it.

VSadov commented 1 month ago

My concerns with this feature would be:

once introduced, the guarantee can never be taken out for compat reasons.
the fact that this my be collected while instance method is executing can be observed in other ways, without finalizers. Any feature that can observe lifetime can see this. For example: assign this to a weak reference and later check if the reference is alive in an instance method. Sometimes you may see that the instance is gone.
If there will be a guarantee that weakref to this never goes dead in instance methods, why only for finalizable objects?

EgorBo commented 1 month ago

For example: assign this to a weak reference and later check if the reference is alive in an instance method.

But it's already pretty random, isn't? It depends on a codegen Tier, e.g. in Tier0 it will always be extended so I doubt anyone depends on this behavior

AaronRobinsonMSFT commented 1 month ago

From the interop perspective, I think this is best suited to be fixed with documentation. I would push back on this work if an interop scenario is solely being used as justification for it.

hez2010 commented 1 month ago

once introduced, the guarantee can never be taken out for compat reasons.

How about introducing a runtime knob for this? So that such behavior can be enabled manually in order to diagnostic issues caused by missing GC.KeepAlive, which can also be a quick workaround before fixing the real issue in the code.

VSadov commented 1 month ago

Premortem finalization is a pretty fragile feature and always has been. Writing a reliable finalizer could be hard for many reasons - it may be invoked earlier than expected, may never be invoked, may be invoked multiple times, bunch of objects that become F-reachable in one GC cycle can access or even resurrect each other, while finalizers run on a single thread, they run in random order so synchronization is tricky, etc...

Java has chosen an interesting way to deal with this - by deprecating finalizers in v9

EgorBo commented 1 month ago

Java has chosen an interesting way to deal with this - by deprecating finalizers in v9

Well, we didn't do the same🙂

The main motivation why I created this issue is that nobody among the people (mostly, seniors) I demonstrated the snippet knew the answer. Of course, it's possible to hit similar issues with e.g. multi-threads (although, there at least people normally do some synchronization) or mess up in the finalizer itself. Since the fix, presumably, is not too difficult (we just need to insert GT_KEEPALIVE node before all rets during IL import in JIT) we can at least save some nerves for some engineers. I am not against writing a good doc instead, I am surprised there were none. Although, the question is - is it discoverable by people who are not aware of this behavior and never hit any issues in Debug?

teo-tsirpanis commented 1 month ago

How about introducing a runtime knob for this? So that such behavior can be enabled manually in order to diagnostic issues caused by missing GC.KeepAlive

IIUC I think that to diagnose these issues you would need more aggressive garbage collection like GC stress, not less aggressive.

cshung commented 1 month ago

Java has chosen an interesting way to deal with this - by deprecating finalizers in v9

I tried to at least recommend moving away from finalizer above, that doesn't go very well.

EgorBo commented 1 month ago

Java has chosen an interesting way to deal with this - by deprecating finalizers in v9

I tried to at least recommend moving away from finalizer above, that doesn't go very well.

It has to be some official statement with a "Deprecated" warning/error issued by Roslyn on use, otherwise it's a normal/popular feature developers use all the time, not only for interop, but also for various things like closing connections/files etc.

In Java they also introduced Cleners API as a replacement and an addition to their "using" alternative, if I read it correctly

AaronRobinsonMSFT commented 1 month ago

I demonstrated the snippet knew the answer.

I think we should really come to terms with when this manifests. It is very uncommon that users writing entirely in managed code are impacted by this behavior. Vlad mentioned one instance involving a weakref, but I don't know how common that truely is.

The real case where this is common, indicated by this very issue, is interop. Jan mentioned it here, and I will say this again - manual interop is hard. Interop is by definition trying to reconcile two disparate, often incredibly complicated, systems. It isn't easy and causes no end to grief without deep knowledge of the two systems, even then it is very hard. Introducing minor "fixes" into the runtime to partially mitigate this friction is going to pessimize the rest of the system for a narrow case that will benefit few users. The vast majority of users rely on .NET built/owned interop solution. There are industrious users that go through the pain to create their own solutions and that is reasonable, but the pain is not unexpected nor something I think we should attempt to fix outside of improved documentation on best practices and recommended patterns.

jkotas commented 1 month ago

Java has chosen an interesting way to deal with this - by deprecating finalizers in v9

Java has phantom references that are a better replacement of finalizers. It allowed them to deprecate finalizers.

In order to deprecate finalizers, we would need to have a replacement story like phantom references.

EgorBo commented 1 month ago

As I said, I am fine with having a doc, I filed this issue to hear thoughs from both runtime folks and the community

manual interop is hard It isn't easy and causes no end to grief without deep knowledge of the two systems, even then it is very hard.

To be honest, I don't think it's a good strategy to rely on all Pinvoke users to be seniors with 10 years of experience and knowing all the little details about the GC and if you're not qualified you can't pinvoke into this Foo() api. Or close your Sql connection object in a finalizer.

PS: I wouldn't file this issue if finalizers were deprecated, but they're not

jkotas commented 1 month ago

Since the fix, presumably, is not too difficult (we just need to insert GT_KEEPALIVE node before all rets during IL import in JIT) we can at least save some nerves for some engineers

If you were to make this a dependable feature, you would also need to fix things throughout the tools. For example:

Code refactoring assumes that it is fine to transform instance method into a static method. It would be a breaking transformation with the proposed feature.
Roslyn is free to generate static methods for unspeakable methods as it sees a fit. Should it stop doing that for instance method that may have finalizers?

I don't think it's a good strategy to rely on all Pinvoke users to be seniors with 10 years of experience and knowing all the little details about the GC

Our interop documentation tells you to use safe handles instead of finalizers: https://learn.microsoft.com/en-us/dotnet/fundamentals/runtime-libraries/system-runtime-interopservices-safehandle . If you do not have the experience required to write manual interop, use safe handles.

AaronRobinsonMSFT commented 1 month ago

To be honest, I don't think it's a good strategy to rely on all Pinvoke users to be seniors with 10 years of experience and knowing all the little details about the GC and if you're not qualified you can't pinvoke into this Foo() api. Or close your Sql connection object in a finalizer.

I don't think this is a fair characterization of the recommendation. Calling a P/Invoke doesn't requiring knowing all this, but if solutions are architected with state living beyond the call, then yes, one can get themselves into a bad spot.

Finalizers are notoriously hard, see Eric Lippert's series from 9 years ago. I can't even count the number of times where we've intentionally suggested that people not write their own finalizer because it is so difficult and prone to errors. The suggestion is always use IDisposable to ensure "you get it right".

In this case, if IDisposable was used this issue wouldn't have occurred.

cshung commented 1 month ago

I am confused, this sounds contradictory to me.

I thought the stance was pro finalizer, let's GC deal with the lifetime:

Encouraging users, by default, to try and manually handle lifetimes with the GC is not appropriate for the .NET ecosystem.

But then, we tell people to not do that many times?

I can't even count the number of times where we've intentionally suggested that people not write their own finalizer

What are we really recommending here, am I missing something?

AaronRobinsonMSFT commented 1 month ago

@cshung See https://github.com/dotnet/runtime/issues/103522#issuecomment-2171011647

We're conflating two different ideas here. For any .NET provided solution, the most appropriate implementation will be taken. For COM, Objective-C, safe handles, that is finalizers. We provide these solutions and they use finalizers. Those solutions are designed to be transparent to the user and non-deterministic - treating all interop related objects like any other .NET objects and letting the GC clean them up. There are times though when we/they need more control so using IDisposable is appropriate as are APIs like Marshal.FinalReleaseComObject().

For official recommendations on interop not in-box, both approaches are valid and should be discussed. We will always recommend built-in solutions first, but if users have a desire to write bespoke solutions we will recommend using IDisposable, but also discuss using the finalizer. The former is clear and very robust, but the latter can provide a more streamlined experience and could also be appropriate. The finalizer approach also comes with significant costs that need to be consider. This guidance on official recommendations around interop and unmanaged resources is what we should add to the interop best practice guide.

cshung commented 1 month ago

We're conflating two different ideas here.

Thanks for the clarity, I think I've got it now. That aligns with what I thought it should be.

As long as:

Users are recommended to use IDisposable over finalizer whenever they roll their own interop, and
Users are warned on the complexity and gotchas if they want to go with the finalizer route for best caller experience.

then I think I am good with that.

We should recommend inbox solution first, and that's sounds great too. The fact that our inbox solution uses finalizer is really just an implementation details that is irrelevant for them, we don't need to mention too much about that if at all. That is what triggered my confusion in the first place (i.e. My reasoning goes as follow: MS inbox solution use finalizer, then that must be a great way to do things, that's just not accurate)

EgorBo commented 2 weeks ago

@MichalPetryka found a related case in a popular library libgit2sharp

It uses a base class with just void* ptr (native resource handle) and has a finalizer that calls virtual Free() that ends up calling native Free for e.g. ObjectHandle overload 🙂

AaronRobinsonMSFT commented 2 weeks ago

@MichalPetryka found a related case in a popular library libgit2sharp

It uses a base class with just void* ptr (native resource handle) and has a finalizer that calls virtual Free() that ends up calling native Free for e.g. ObjectHandle overload 🙂

Looking at the base type, Libgit2Object, this is a perfect candidate for deriving from SafeHandle as recommended.

AaronRobinsonMSFT commented 2 weeks ago

From the interop perspective, I think this can be closed. The documentation for interop best-practices has been updated with notes on SafeHandle and finalizers. We also discuss this very topic in Implement a Dispose method.

dotnet / runtime

Implicit GC.KeepAlive in classes with finalizers #103522

Proposed solution