Open AaronRobinsonMSFT opened 1 month ago
Tagging subscribers to this area: @dotnet/interop-contrib See info in area-owners.md if you want to be subscribed.
/cc @jonpryor @BrzVlad @steveisok @vitek-karas @simonrozsival @ivanpovazan @jkotas @jonathanpeppers
In addition to the computation and passing of the SCCs and CCRs, that are handled by this API, there are a few implementation details currently present on mono.
After the original rounds of discussions around design, the conclusion was that we will have the fetching of a WeakReference target wait for bridge processing and the finishing step of the bridge processing would clear the weak refs for bridge objects with dead java peers. It seems that the only way to achieve this would to fully iterate over the weakrefs and null them, every single time a ReferenceTrackingHandle is freed. I don't think we can do this when we get back via ReleaseMarkCrossReferenceResources
since we have no information about which objects were collected and which not.
Given this might be somewhat expensive, it raises the question on how relevant this nulling of weakrefs really is, considering that, even with the current design, resurrection could still happen and we can end up, in theory, with a C# object that had its java peer collected. I couldn't find a detailed reason for its existence, it seems to have been added in https://github.com/mono/mono/commit/572bd8b01721c4cf423ce1d66f96a195c54f983a
but the reference to what exact kind of bug it was fixing is lost. It would surely be cleaner if the WeakReference would only have knowledge of the C# side of things. The WeakReference would then be nulled at the following collection. Calling the Target method could bring the object back alive, but is it really a problem ? I would assume the object would have its own state indicating whether the java peer is collected or not anyway, doesn't seem like it needs to rely on the weak refs for this.
So as potential changes for this api:
ReleaseMarkCrossReferenceResources
to also receive the set of to be freed crossref handles so we can iterate over the weakrefs only once ?WaitForBridgeProcessing
api here ? And not have the WeakReference.Target block. The WaitForBridgeProcessing
barrier is necessary for reintroducing references. We must prevent any mutations on the C# side that would invalidate the compressed graph that was sent to the Java side until it is fully processed.
For a more visual explanation, please watch https://youtu.be/h57uvjMQMgU?t=534 which describes the same problem albeit with a slightly more eleborate (and less blocking) solution.
@AaronRobinsonMSFT: what should the markCrossReferences
callback do? This should be documented.
I had thought that it would be"mirror" MonoGCBridgeCallbacks::cross_references
callback, along with:
ComponentCrossReference
mirroring MonoGCBridgeXRef
: https://github.com/dotnet/runtime/blob/9bcd3e75a2984537470cb0b6f9bf043f48cc8307/src/native/public/mono/metadata/details/sgen-bridge-types.h#L32-L35StronglyConnectedComponent
mirroring MonoGCBridgeSCC
: https://github.com/dotnet/runtime/blob/9bcd3e75a2984537470cb0b6f9bf043f48cc8307/src/native/public/mono/metadata/details/sgen-bridge-types.h#L26-L30However, one field is missing in this world view: MonoGCBridgeSCC::is_alive
: https://github.com/dotnet/runtime/blob/9bcd3e75a2984537470cb0b6f9bf043f48cc8307/src/native/public/mono/metadata/details/sgen-bridge-types.h#L27
Thus, how does markCrossReferences
tell the GC "this object is dead/still alive"? Do we need to add a StronglyConnectedComponent.IsAlive
field?
partial struct StronglyConnectedComponent {
public int IsAlive; // 0 if dead, non-0 if alive
}
Naming: StronglyConnectedComponent.Context
should probably be StronglyConnectedComponent.Contexts
(plural), is it contains StronglyConnectedComponent.Count
Context values (plural).
@BrzVlad
If we still do the weakref clearing, would it make sense for ReleaseMarkCrossReferenceResources to also receive the set of to be freed crossref handles so we can iterate over the weakrefs only once ?
That seems reasonable to me. @jkotas and I also previously talked about another callback, but replacing the "Release" with a broader "Finished" that serves both for memory clean-up and passing along the freed handles works too.
If we would no longer do weakref clearing, would we need an explicit WaitForBridgeProcessing api here ? And not have the WeakReference.Target block.
An explicit WaitForBridgeProcessing API shouldn't be needed as we can handle this as an implementation detail where the aforementioned "Finished" API must be called to track the bridge is now done.
@jonpryor
This should be documented.
Yep, it will be.
Thus, how does markCrossReferences tell the GC "this object is dead/still alive"?
The markCrossReferences
callback's job is to do the actual freeing of the associated GCHandle
s in the JNI value map. See the proof-of-concept's map clean-up here and corresponding bridge implementation here.
@AaronRobinsonMSFT wrote:
The
markCrossReferences
callback's job is to do the actual freeing of the associatedGCHandle
s in the JNI value map. See the proof-of-concept's map clean-up here and corresponding bridge implementation here.
which doesn't actually quite make sense to me: the bridge impl has a JVMTriggerGC()
invocation, i.e. it contains JNI calls, after which it can check to see if the the Java GC collected the JNI weak ref.
https://github.com/dotnet/runtime/pull/114184 , meanwhile, has no Java System.gc()
invocations at all.
I thus don't fully understand how it works.
https://github.com/dotnet/runtime/pull/114184 , meanwhile, has no Java System.gc() invocations at all.
Right. Any calls to JVM APIs are the domain of the GC Bridge (2nd stage) and not needed in the dotnet/runtime repo. Only the SCC building (1st stage) is handled by runtime and that is passed to the 2nd stage which will do the JVM interaction.
@BrzVlad
If we still do the weakref clearing, would it make sense for ReleaseMarkCrossReferenceResources to also receive the set of to be freed crossref handles so we can iterate over the weakrefs only once ?
That seems reasonable to me. @jkotas and I also previously talked about another callback, but replacing the "Release" with a broader "Finished" that serves both for memory clean-up and passing along the freed handles works too.
It sounds like the workflow you're invisioning goes something like this:
1) GC builds (1st stage) the SCCs/XRefs and collects interesting WeakReference
instances.
2) Passes the SCCs/XRefs to the GC Bridge (2nd stage), which reifies the graph in the JVM and triggers the JVM GC.
3) The JVM GC completes and the 2nd stage now knows the dead objects.
4) The 2nd stage passes the keys to the dead objects to they can be looked up in the JNI value map in the Android for .NET runtime.
5) The Android for .NET runtime uses the keys to remove the GCHandle
s to the dead objects from some hashmap (perhaps, Dictionary<nint, GCHandle>
?).
6) The removed GCHandle
instances are collected and passed, along with the MarkCrossReferences*
, to some "Finished" API that can then free the GCHandle
and null out the interesting WeakReference
instances collected in (1).
Steps (2) through (6) would all occur under what is now considered the "bridge processing" lock.
An explicit WaitForBridgeProcessing API shouldn't be needed as we can handle this as an implementation detail where the aforementioned "Finished" API must be called to track the bridge is now done.
I'm not sure that's race free. We need to prevent any modification to the compressed SCC graph. That's currently ensured by calling WaitForBridgeProcessing
whenever a reference is reintroduced from Java VM into the .NET VM. markCrossReferences
is only called after SCC graph is constructed but is there any guarantee that while it is being constructed that no other thread can create a new reference that would modify the graph? (the prototype implementations don't call markCrossReferences
synchronously in the STW stage) There's a window where the Android side of the bridge doesn't know that a GC is in progress and it could potentially result in an inconsistency.
That's currently ensured by calling WaitForBridgeProcessing whenever a reference is reintroduced from Java VM into the .NET VM.
These are independent operations in this new model. GC bridge still needs to lock around their dictionary just as they currently do. The lock within is an implementation detail that avoids us having to expose an explicit begin/end callback mechanism. We internally call markCrossReferences
(begin) and we expect callers to call "Finish". They need a lock around their own collection while they are processing (2nd stage). It should naturally be broader than the internal runtime logic, but that is the domain of the interop layer and responsible for doing it correctly.
I'm not concerned about a race condition between markCrossReferences
and finishing the cross GC operations. I am concerned about a race condition between the start of GC where some of the bridge objects referenced by their respective GCHandle
appear dead (ie. not referenced by .NET objects) and the call to markCrossReferences
. What happens if another thread revives one of these references before markCrossReferences
callback is invoked?
Timeline:
markCrossReferences
, reference Y -> X is missing from the graph, X may accidentally get collected while Y is aliveWhat happens if another thread revives one of these references before markCrossReferences callback is invoked?
This is triggered by the GC during a gen2. All threads are suspended.
This is triggered by the GC during a gen2. All threads are suspended.
So that design is intentionally different from MonoVM which did NOT call the cross references callback in stop-the-world phase, correct? Was there any thought about possible interactions with background GCs? Just trying to make sure that this is specified, not necessarily questioning the intent...
It essentially implies that the callback cannot be implemented in managed code and that Android/Java.Interop is responsible for saving the information and offloading it to background processing (unless you block the .NET GC until the Java GC finishes along with all the threads in the stop-the-world stage).
@AaronRobinsonMSFT
If we would no longer do weakref clearing, would we need an explicit WaitForBridgeProcessing api here ? And not have the WeakReference.Target block.
An explicit WaitForBridgeProcessing API shouldn't be needed as we can handle this as an implementation detail where the aforementioned "Finished" API must be called to track the bridge is now done.
Are you suggesting here that .NET Android could do its own synchronization to prevent the issue that Filip was describing, without any help from the runtime ? For example, when markCrossReferences
is called during STW, they could set their own bridge in progress flag and have their own condition variable that is signaled after they call back into the runtime via ReleaseMarkCrossReferenceResources
?
- GC builds (1st stage) the SCCs/XRefs and collects interesting
WeakReference
instances.
I don't see why we would need to collect any WeakReference
instances here. Seems like we can handle this directly in step 6, exactly as you mentioned. We would have the required cross ref GCHandles here with dead java peers, we would iterate over all WeakReferences
and null every one of them associated with these mentioned cross ref handles. We could afterwards also free these cross ref handles. This new Finished
api receiving additionally a GCHandle*
array and its size sound good to me.
@jonpryor Any opinion on nulling of the WeakReference
GCHandles for C# objects once the bridge processing detects their java peer is dead. Do you know if .NET Android relies on this behavior in any way ? So for example, to better visualize the workflow
Thread1 | Thread2 (.NET Android worker)
|
GC1 |
StopWorld |
normal GC mark |
compute SCCs and xrefs |
markCrossReferences cb |
| Start doing java.GC
finish collection |
| .....
ResumeWorld | .....
..... | .....
..... |
WeakReference.Target WaitBridge |
| obtain set of cross ref handles with dead java peers
| ReleaseMarkCrossReferenceResources
| (This would null WeakRefs for bridge objects with java peer collected, if we preserve mono behavior)
| now gc bridge processing is finished
WeakReference.Target Done |
..... |
GC2 |
... |
The alternative would be that WeakReference
would have no insight into bridge objects and it would be nulled as part of GC2, when the object is seen as dead once again, this time without any java peer / crossref gchandle around to hold it alive. Also note that for this API there is no longer the requirement for Bridge objects to have finalizers, as it was the case on mono.
Any opinion on nulling of the
WeakReference
GCHandles for C# objects once the bridge processing detects their java peer is dead. Do you know if .NET Android relies on this behavior in any way ?
There's a problem in the example you've shown. Resolving the WeakReference
can be done from user code. If it doesn't do WaitForBridge
and getting null
ed then user code can get a reference to Java object while Thread2
is performing the Java collection. It can assign the WeakReference.Target
to some variable, once again creating an addition X -> Y edge that is not captured in the cross-references graph, and you will end up with object Y in "disposed" state for C# code that's perfectly valid.
Pseudo-code:
WeakReference<JavaHttpClient> cachedHttpClient;
... // GC1 start here
JavaHttpClient? localHttpClient = cachedHttpClient.Target;
if (localHttpClient is null) {
// Create new HTTP client and assign it to localHttpClient and cachedHttpClient
}
// Thread2 may have finished Java GC and disposed the JavaHttpClient instance
await localHttpClient.GetAsync(...);
@filipnavara Overall the summary of my point is that, due to resurrection, in theory it is possible to have C# objects with dead java peers, even now on mono. For example
class ObjWithFinalizer {
Bridge bridge;
~ObjWithFinalizer()
{
// Use bridge
}
}
ObjWithFinalizer obj = ...;
// obj dies
GC();
// bridge and obj are seen as dead by the GC.
// obj gets enqueued to have the finalizer run
// java peer of bridge object could end up collected
// Finalizer gets called at some point and bridge object is used with disposed java counterpart
So waiting and nulling WeakReferences
doesn't 100% prevent this issue, seems more like a best effort. I'm not saying we should give up on it, rather just making sure we have no other options and we want to keep doing this, before implementing it.
Overall the summary of my point is that, due to resurrection, in theory it is possible to have C# objects with dead java peers, even now on mono.
Definitely, but using finalizer and resurrecting objects is rather odd edge case where I expect all the hell to break loose.
Using WeakReference
s pointing to Java objects is extremely common in UI frameworks and applications and MonoVM supports this pattern.
@BrzVlad
For example, when markCrossReferences is called during STW, they could set their own bridge in progress flag and have their own condition variable that is signaled after they call back into the runtime via ReleaseMarkCrossReferenceResources ?
Yes. They will need to do that anyways and already do. The runtime isn't going to be able to lock their dictionary that maps from JNI handle to GCHandle
(prototype) - currently they do this. They will need their own indication that that map is "locked" while the 2nd stage is running. That handles the Android runtime side of things as it relates to resurrection. The runtime part of resurrection via WeakReference
/WeakReference<T>
needs the narrower "internal 2nd stage running" lock to indicate that certain weak references are impacted by the 2nd stage running. This of course is an optimization and the GC could just consider all weak references suspect for now, but a possible future optimization is determining which weak references are potentially interesting and only blocking on those if the 2nd stage is running.
So for example, to better visualize the workflow
This is exactly the workflow being proposed. We're on the same page.
It looks to me like the current proposed API shape doesn't cover enough surface to make WeakReference
do the WaitForBridgeProcessing
call. Previous discussion implies that the JavaMarshal
consumer is responsible for locking between markCrossReferences
and the end of processing (up to the point beyond ReleaseMarkCrossReferenceResources
call). The runtime would need to either duplicate the work and depend on ReleaseMarkCrossReferenceResources
to release its own "processing lock" or there needs to be a more explicit mechanism.
I'm still generally not convinced whether it should be the consumer of JavaMarshal
doing this synchronization and offloading of the processing to a background thread or whether the runtime should do it. If the runtime were to do it then we would need to add WaitForBridgeProcessing
to the API surface. The obvious benefit would be that the background thread can run JIT code so all of the processing can be done in managed code.
It looks to me like the current proposed API shape doesn't cover enough surface to make WeakReference do the WaitForBridgeProcessing call.
Correct. The current "Release" needs to be updated to be a "Finish" and pass along the GCHandle
s to free and null out any related weak references.
I'm still generally not convinced whether it should be the consumer of JavaMarshal doing this synchronization and offloading of the processing to a background thread or whether the runtime should do it.
It is going to be simpler for the bridge to handle the thread management or else this API becomes more complicated in an area we are explicitly not trying to innovate on. As previously mentioned if it occurs we need to innovate it won't just be the data structures but also the general API and semantics that will move as well.
The obvious benefit would be that the background thread can run JIT code so all of the processing can be done in managed code.
The background thread can be written in managed with this approach as well and it is something we could experiement with, but my guess is it will be less performant to do it in managed code. I say that because, looking at @BrzVlad's flow diagram, it would mean that while the 2nd stage was running subsequent GCs would impact it and delay getting the 2nd stage complete. The logic in the 2nd stage is relatively straight forward, but does interact with the JVM a fair bit. Pushing that logic into managed would also increase the number of interop calls substantially for very little benefit (JVM strong <-> weak refs).
it would mean that while the 2nd stage was running subsequent GCs would image it and delay getting the 2nd stage complete.
I guess that's fair point for not doing it in managed code. My argument was never about allowing a part of it in managed code, it was about allowing all of it in managed code which was tried (https://github.com/dotnet/java-interop/pull/1334) and failed for obvious reasons.
@BrzVlad Do you think with the updated API shape we can take this to API review? I'd like to get this reviewed sooner rather than later.
In my opinion the API looks good. To better visualize the usage/implementation, I have the prototype at https://github.com/BrzVlad/runtime/commits/feature-clr-gcbridge/ which appears to be working correctly, passing the local test that I added. The branch is a WIP, I haven't yet cleaned it up. The last two commits contain the waiting for bridge processing when checking the target of a WeakReference
and the clearing of the weakrefs in the bridge processing finish stage. It would be nice to take a quick look over these 2 commits to see whether it looks good from design perspective so I can start preparing my branch for integration into coreclr and have confirmation that this is the API we are going with. @AaronRobinsonMSFT @jkotas @Maoni0
@BrzVlad The overall flow/architecture look good. There are two issues that need to be addressed.
1) There is a GC hole in the call to FinishCrossReferenceProcessing
. We're passing a void*
to the byref contained within a Span<T>
. That should really be an ObjectOnStack
instance and extracted in a cooperative block.
2) The GCHandleInternalGetBridgeWait
will need a fast/slow split instead of a single FCall that switches to preemptive mode. The FCall should return immediately with "true" if it is free, otherwise it should return "false" and then call a QCall that does the actual work in preemptive mode.
Once the branch is in a PR, I can make a few changes on the VM side. Otherwise, this looks like what we've talked about.
void*
sLen
suffixes became Count
namespace System.Runtime.InteropServices.Java;
[SupportedOSPlatform("android")]
[StructLayout(LayoutKind.Sequential)]
public struct ComponentCrossReference
{
public nint SourceGroupIndex;
public nint DestinationGroupIndex;
}
[SupportedOSPlatform("android")]
[StructLayout(LayoutKind.Sequential)]
public unsafe struct StronglyConnectedComponent
{
public nuint Count;
public void** Contexts;
}
[SupportedOSPlatform("android")]
[StructLayout(LayoutKind.Sequential)]
public unsafe struct MarkCrossReferencesArgs
{
public nuint ComponentCount;
public StronglyConnectedComponent* Components;
public nuint CrossReferenceCount;
public ComponentCrossReference* CrossReferences;
}
[SupportedOSPlatform("android")]
partial static class JavaMarshal
{
public static unsafe void Initialize(delegate* unmanaged<MarkCrossReferencesArgs*, void> markCrossReferences);
public static unsafe GCHandle CreateCrossReferenceHandle(object obj, void* context);
public static unsafe void* GetContext(GCHandle obj);
public static unsafe void FinishCrossReferenceProcessing(
MarkCrossReferencesArgs* crossReferences,
ReadOnlySpan<GCHandle> unreachableObjectHandles);
}
IMO void*
should stay as nint
.
Changing context to void pointer is a narrowing change to the communicated semantics. For example, an implementation might really want to store an index, and now it needs to check how this void*
value is used. If the marshaling infra only ever considers it opaque data it should not be void typed (C# does not, and has no need to, mirror historical C-isms around void vs intptr_t).
Background and motivation
For interop with JVM based languages, an interaction model must be made to facilitate lifetime managements through GC coordination. This support is primary for Android for .NET, but there is no technical limitation to this support being enabled for non-Android scenarios.
See https://github.com/dotnet/runtime/pull/114184 and https://github.com/BrzVlad/runtime/tree/feature-clr-gcbridge
API Proposal
API Usage
Case example can be found at https://github.com/dotnet/runtime/pull/114184
Alternative Designs
Place under
System.Runtime.InteropServices.GCBridge
Do not limit to
SupportedOSPlatform("android")
Risks
Risk is missing some scenario/semantic in order to ensure Android for .NET is successful. This would be a new interop stack and be narrow in scope, following our Swift, ObjectiveC and WinRT interop solutions in .NET Core.