Closed RamType0 closed 9 months ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
We need handsome,and performant API.
This doesn't really describe the motivation, just the desired solution. Where do you see this API being used? Do you anticipate those scenarios seeing a measurable perf benefit? That would help justify the work needed for this API.
Thanks!
I have used this API to handle exception from each invoked delegate.
Would syntax like the following work for you instead? I don't think there's a reliable way to create a one-element ROS<Delegate>
without allocating or introducing a new field to MulticastDelegate
, neither of which would be desirable.
SomeDelegate myDelegate = GetDelegate();
foreach (Delegate innerDelegate in myDelegate)
{
var castDelegate = (SomeDelegate)innerDelegate;
// use 'castDelegate' here
}
You should also spend some time prototyping this to make sure that the performance is what you expect. In your own application, make sure that the cost of this allocation really is measurable. You can use FieldInfo.GetValue
via reflection to access the private _invocationList
instance field for the purpose of testing.
Ah,I just mistaken that we need ref Delegate for CreateSpan,but not Delegate. But there are still available solution. alter type of MuticastDelegate.delegates to object,and just assign it self to it if it is not multicasting. Then, implementation would become
public ReadOnlySpan<Delegate> InvocationList => delegates == this ? MemoryMarshal.CreateSpan<Delegate>(ref delegates,1) : Unsafe.As<Delegate[]>( delegates);
just assign it self to it if it is not multicasting.
That's probably not a viable solution. There's a lot of code in the runtime - both managed and unmanaged - that has knowledge about the exact layout of delegate instances and what all the different instance fields represent. It would be very risky (and possibly also a negative perf hit) to update all of those call sites. That's not a good tradeoff for a niche API.
Aren't you confusing it with field of Delegate? I saw many comment which is similar to what you say in Delegate. But I have not seen such a comment in MulticastDelegate.
But I have not seen such a comment in MulticastDelegate.
See here for some examples. You can follow the call graphs backward and see that there's a decent amount of code which relies on this field containing a restricted set of possible values.
Oh,maybe I was seeing mono's one.
SomeDelegate myDelegate = GetDelegate(); foreach (Delegate innerDelegate in myDelegate) { var castDelegate = (SomeDelegate)innerDelegate; // use 'castDelegate' here }
Such syntax (as long as it is not allocating) would actually be very nice for when you'd have to invoke each delegate within a try catch block to ensure each gets called. So e.g.
SomeDelegate myDelegate = GetDelegate(); foreach (Delegate innerDelegate in myDelegate) { try { // invoke single delegate target ((SomeDelegate)innerDelegate).Invoke() } catch (Exception exception) { //Do something with that exception } }
As of right now we (Example) just call GetInvocationList() which always allocates. (It shows up already in measurements, although not yet high up in the list. Still working on reducing others first)
I have just created benchmark project. In this project,we have these variants of GetInvocationlist.
CoreCLRGetInvocationList Just a copy from current CoreCLR's implementation.
MyGetInvocationList My optimized implementation of GetInvocationList without altering or adding any API.
UnsafeGetInvocationList My optimized allocation less new API. This is unsafe because the result will become wrong if we access the result after altering the delegate. But if we use this correctly, it is very fast and safe.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.508 (2004/?/20H1)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake), 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
Method | N | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|
CoreCLRGetInvocationList | 1 | 10.774 ns | 0.1214 ns | 0.1077 ns | 0.0038 | - | - | 32 B |
MyGetInvocationList | 1 | 9.302 ns | 0.1656 ns | 0.1549 ns | 0.0038 | - | - | 32 B |
UnsafeGetInvocationList | 1 | 2.917 ns | 0.0370 ns | 0.0346 ns | - | - | - | - |
CoreCLRGetInvocationList | 10 | 82.291 ns | 0.7804 ns | 0.7299 ns | 0.0124 | - | - | 104 B |
MyGetInvocationList | 10 | 23.457 ns | 0.4810 ns | 0.4499 ns | 0.0124 | - | - | 104 B |
UnsafeGetInvocationList | 10 | 7.849 ns | 0.0776 ns | 0.0726 ns | - | - | - | - |
CoreCLRGetInvocationList | 100 | 732.493 ns | 10.2674 ns | 9.6041 ns | 0.0982 | - | - | 824 B |
MyGetInvocationList | 100 | 154.543 ns | 3.0153 ns | 4.2270 ns | 0.0985 | - | - | 824 B |
UnsafeGetInvocationList | 100 | 53.075 ns | 0.5351 ns | 0.4743 ns | - | - | - | - |
CoreCLRGetInvocationList | 1000 | 7,922.301 ns | 100.5212 ns | 89.1095 ns | 0.9460 | - | - | 8024 B |
MyGetInvocationList | 1000 | 1,405.915 ns | 9.4829 ns | 8.8703 ns | 0.9575 | - | - | 8024 B |
UnsafeGetInvocationList | 1000 | 676.750 ns | 6.3342 ns | 5.6151 ns | - | - | - | - |
@RamType0 The benchmark code makes incorrect assumptions about the layout of that field, and it makes incorrect assumptions about how the C# language treats reachability of objects passed as in parameters. This could lead to reliability problems at runtime.
I'll re-up my comment from https://github.com/dotnet/runtime/issues/41849#issuecomment-686898306. We could probably add an enumerator to allow this scenario if it's really that important, but creating a ROS<T>
is likely a non-starter. For now, a workaround could be to call GetInvocationList
once and cache the result.
@RamType0 The benchmark code makes incorrect assumptions about the layout of that field, and it makes incorrect assumptions about how the C# language treats reachability of objects passed as in parameters. This could lead to reliability problems at runtime.
This caused just by MemoryMarshal.CreateReadOnlySpan. Actually, C# compiler correctly handles this.
public static ref readonly T UnsafeGetInvocationListByRefAndCount<T>(in T d,out int count)
where T : MyMulticastDelegate
{
var _invocationList = d._invocationList;
if (_invocationList != null)
{
var invocationList = Unsafe.As<object[]>(_invocationList);
count = (int)d._invocationCount;
return ref Unsafe.As<object, T>(ref MemoryMarshal.GetArrayDataReference(invocationList));
}
else
{
count = 1;
return ref d;
}
}
So, it is not a UnsafeGetInvocationList specific problem. And also, this problem has been forgiven for MemoryMarshal.CreateReadOnlySpan, which is already approved public API.
I'll re-up my comment from #41849 (comment). We could probably add an enumerator to allow this scenario if it's really that important, but creating a
ROS<T>
is likely a non-starter. For now, a workaround could be to callGetInvocationList
once and cache the result.
It is better than current API.
But we have thought that List
I'll re-up my comment from #41849 (comment). We could probably add an enumerator to allow this scenario if it's really that important, but creating a
ROS<T>
is likely a non-starter. For now, a workaround could be to callGetInvocationList
once and cache the result.
In which use case (where you'd use this API) would a ROS
@GrabYourPitchforks In your example you've set the type of the loop variable to Delegate
, I assume it wouldn't be possible and/or safe to type that to the actual delegate type (e.g. foreach (Action<int> innerDelegate in myDelegate)
?
@bollhals If it's an instance method, the type would be Delegate
, and the caller would need to cast. (These casts should be very cheap.) If it's an extension method, it can probably be typed appropriately.
But we have thought that List.Enumerator is slow
Exactly what is slow about it?
But we have thought that List.Enumerator is slow
Exactly what is slow about it?
using foreach
for List<T>
is slower than using foreach
for T[]
.
Then, we have finally created CollectionsMarshal.AsSpan.
This unsafe API was created just for performance.
@GrabYourPitchforks BTW, MyGetInvocationList is seems to be better than CoreCLRGetInvocationList without any API changes. If I created PR for it, is it beneficial?
If im not mistaken real reason was performance of list with large structs. Eg imagine a list with million Matrix4x4 structs. Each such matrix holds 16 floats making it pretty bulky and each such struct had to be copied every time a value was retrieved via foreach in our case meaning potentially 1 milion copies of relatively large struct. This can definitely tank performance. Class based cases are largely unaffected unless u wanted to swap references itself during iteration
If im not mistaken real reason was performance of list with large structs. Eg imagine a list with million Matrix4x4 structs. Each such matrix holds 16 floats making it pretty bulky and each such struct had to be copied every time a value was retrieved via foreach in our case meaning potentially 1 milion copies of relatively large struct. This can definitely tank performance. Class based cases are largely unaffected unless u wanted to swap references itself during iteration
At least, the posted benchmark uses int for element, which is smaller than "reference".
using
foreach
forList<T>
is slower than usingforeach
forT[]
.
Being skeptical of a post made more than 10 years ago, I went and did a proper benchmark.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-10610U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
[Host] : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
.NET Core 3.1 : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
.NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
Method | Job | Runtime | Mean | Error | StdDev | Median |
---|---|---|---|---|---|---|
LargeStructList | .NET Core 3.1 | .NET Core 3.1 | 78.698 ns | 0.8484 ns | 0.7084 ns | 78.545 ns |
LargeStructArray | .NET Core 3.1 | .NET Core 3.1 | 28.893 ns | 0.2721 ns | 0.2546 ns | 28.873 ns |
ClassList | .NET Core 3.1 | .NET Core 3.1 | 73.004 ns | 1.4792 ns | 3.0217 ns | 74.361 ns |
ClassArray | .NET Core 3.1 | .NET Core 3.1 | 8.114 ns | 0.0707 ns | 0.0661 ns | 8.104 ns |
IntList | .NET Core 3.1 | .NET Core 3.1 | 42.759 ns | 0.2340 ns | 0.1954 ns | 42.711 ns |
IntArray | .NET Core 3.1 | .NET Core 3.1 | 7.497 ns | 0.1566 ns | 0.1465 ns | 7.479 ns |
LargeStructList | .NET Core 5.0 | .NET Core 5.0 | 70.169 ns | 0.9476 ns | 0.8400 ns | 69.955 ns |
LargeStructArray | .NET Core 5.0 | .NET Core 5.0 | 24.355 ns | 0.5135 ns | 0.9893 ns | 24.142 ns |
ClassList | .NET Core 5.0 | .NET Core 5.0 | 67.844 ns | 0.6781 ns | 0.6343 ns | 67.873 ns |
ClassArray | .NET Core 5.0 | .NET Core 5.0 | 6.542 ns | 0.0661 ns | 0.0586 ns | 6.543 ns |
IntList | .NET Core 5.0 | .NET Core 5.0 | 44.817 ns | 0.7760 ns | 0.7258 ns | 44.625 ns |
IntArray | .NET Core 5.0 | .NET Core 5.0 | 7.864 ns | 0.2957 ns | 0.8718 ns | 7.717 ns |
So I concede that lists are slower than a raw array, however I also don't think this difference is the end of the world, provided you're not literally on fire on a burning hot path. Now I'm very curious on where this difference comes from, since I was assuming it could get much closer to array performance than this.
Now I'm very curious on where this difference comes from
foreach on an array is desugared by the C# compiler into simple for loop. foreach on a list needs to go through its struct Enumerator, does version checks to make sure the list hasn't changed between iterations, etc.
@Joe4evr Thanks for posting the benchmark! I couldn't find benchmark in recent environment.
So I concede that lists are slower than a raw array, however I also don't think this difference is the end of the world, provided you're not literally on fire on a burning hot path. Now I'm very curious on where this difference comes from, since I was assuming it could get much closer to array performance than this.
"foreach" for span or array is converted to "for". And also, the generated pattern of "for" eliminates bounds check to throwing IndexOutOfRangeException. Both of these optimization is specific for array or span.
It is also seems that IntList is significant faster than ClassList. Maybe this is because of array's covariant check.
The CollectionsMarshal class was added to allow efficient bulk operations over the list, such as bulk moving data or bulk modification of data. It wasn't added because of any perceived deficiency in List's enumerator.
We're getting very off-track here. If the request is specifically for projection of a MulticastDelegate as a ROS over its inner targets, then that is not feasible (as previously mentioned) and this issue should be closed. This issue provided several alternative designs, but if we don't have a consumer for those alternative APIs then this work will never rise above the cut line for any release.
We're getting very off-track here. If the request is specifically for projection of a MulticastDelegate as a ROS over its inner targets, then that is not feasible (as previously mentioned) and this issue should be closed. This issue provided several alternative designs, but if we don't have a consumer for those alternative APIs then this work will never rise above the cut line for any release.
For me I think the main goal should be that there is a simple and easily accessable way (that has zero allocation) to iterate over the individual delegates. Ideally this improves performance but this should not be performance critical.
Something that was asked earlier and never answered: why is it infeasible to call GetInvocationList
(taking the allocation hit) once, then cache the result? That would solve the immediate problem, and it doesn't require any new API.
Something that was asked earlier and never answered: why is it infeasible to call
GetInvocationList
(taking the allocation hit) once, then cache the result? That would solve the immediate problem, and it doesn't require any new API.
In scenario with concurrency and multiple delegates, we need 'ConcurrentDictionary<Delegate,Delegate[]?> ' for caching it. ConcurrentDictionary allocates a lot. And also, it is inconvinient.
How about TryGetInvocationList? It only returns invocation list when it was multicasted.
How about TryGetInvocationList?
I don't think a Try method is appropriate here. If there's new API being added, it should work in all cases.
In scenario with concurrency and multiple delegates, we need
ConcurrentDictionary<Delegate,Delegate[]?>
for caching it.
No you don't. The typical scenario for extracting the invocation list is when implementing an event handler. You control the implementation of the add and remove events, which means you can do something like (pseudocode):
private object _lockObj = new object();
private MyEventHandler _myEvent;
private MyEventHander[] _invocationList;
public event MyEventHandler MyEvent
{
add { lock(_lockObj) { _myEvent += value; _invocationList = _myEvent.GetInvocationList(); } }
remove { lock(_lockObj) { _myEvent -= value; _invocationList = _myEvent.GetInvocationList(); } }
}
private void FireEvents()
{
MyEventHandler[] invocationList;
lock (_lockObj) { invocationList = _invocationList; }
foreach (var handler in invocationList)
{
// do something
}
}
The only time you ever have to update the cached invocation list is when somebody adds or removes a handler from the event. In all other cases, especially where you take a method that takes an opaque delegate as a parameter, it's generally not appropriate to crack open the delegate and retrieve its invocation list. In those scenarios, it's generally more appropriate to treat the delegate as a single opaque Invoke method and to assume that the caller constructed it using whatever wrapping was appropriate for their own scenario.
For async events (i.e. returning Task
/ValueTask
) to work, iterating the invocation list and invoking/awaiting them individually while collecting exceptions is necessary. There's a good example of this in Microsoft.VisualStudio.Threading:
Having a way to enumerate the handlers without allocating is pretty important for cases like this. While something like https://github.com/dotnet/runtime/issues/41849#issuecomment-821870977 will work, it's not ergonomic to say the least. The idea in https://github.com/dotnet/runtime/issues/41849#issuecomment-686898306 seems completely sensible to me to fulfill this need.
How about TryGetInvocationList?
I don't think a Try method is appropriate here. If there's new API being added, it should work in all cases.
In scenario with concurrency and multiple delegates, we need
ConcurrentDictionary<Delegate,Delegate[]?>
for caching it.No you don't. The typical scenario for extracting the invocation list is when implementing an event handler. You control the implementation of the add and remove events, which means you can do something like (pseudocode):
private object _lockObj = new object(); private MyEventHandler _myEvent; private MyEventHander[] _invocationList; public event MyEventHandler MyEvent { add { lock(_lockObj) { _myEvent += value; _invocationList = _myEvent.GetInvocationList(); } } remove { lock(_lockObj) { _myEvent -= value; _invocationList = _myEvent.GetInvocationList(); } } } private void FireEvents() { MyEventHandler[] invocationList; lock (_lockObj) { invocationList = _invocationList; } foreach (var handler in invocationList) { // do something } }
The only time you ever have to update the cached invocation list is when somebody adds or removes a handler from the event. In all other cases, especially where you take a method that takes an opaque delegate as a parameter, it's generally not appropriate to crack open the delegate and retrieve its invocation list. In those scenarios, it's generally more appropriate to treat the delegate as a single opaque Invoke method and to assume that the caller constructed it using whatever wrapping was appropriate for their own scenario.
Does this work for nested multicast delegate?
Also, to achieve my motivation(handle exception from each invoked delegate), we must call get invocation list recursively. And we need to stop recursively doing this when we reached non-multicasting delegate. So I think Try is more suitable.
We could probably add an enumerator to allow this scenario if it's really that important, but creating a ROS
is likely a non-starter.
+1 . This needs to be an enumerator. Returning ROS
@eerhardt @GrabYourPitchforks @stephentoub I have update the API proposal at the top to enumerator. Thoughts about the API shape?
I want variant for how it handles nested multicast delegate.(Automatically unrolls it or not)
What do you mean by "nested multicast delegate"?
What do you mean by "nested multicast delegate"?
The multicast delegate which is made from multiple multicast delegates.
Can you give a concrete example of what you mean? Multicast delegates should already be 'flattened' by virtue of how they work.
Looking at the StackExchange.Redis code that is trying to do this, I see 2 categories of usage:
I have update the API proposal at the top to enumerator. Thoughts about the API shape?
My initial questions are:
static
method? Why not an instance method?foreach
? Possibly something very similar to how Pipelines.Sockets.Unofficial
defined its extension method (only just an normal instance method and not an extension method)?
https://github.com/mgravell/Pipelines.Sockets.Unofficial/blob/6703efc310131037ee55b3136ff0d2d9e0df1e3f/src/Pipelines.Sockets.Unofficial/Delegates.cs#L24-L28 Can it be updated to have an Enumerable and an Enumerator? That way it can be used in a foreach?
StringBuilder.GetChunks returns an "enumerator" that also meets the foreach pattern, as it exposes a GetEnumerator method that just returns itself. We could do the same here.
Why is it a static method? Why not an instance method?
If it was an instance method, it would need to be Delegate and callers would need to cast. I think we want to the API to be strongly typed to avoid the cast. I have changed it to be static extension method. Does it look better?
Can it be updated to have an Enumerable and an Enumerator?
Done.
Would suggest EnumerateInvocationList
or similar instead of GetInvocationListAsEnumerable
. See string.EnumerateRunes()
as an example of a similar pattern:
Would suggest EnumerateInvocationList or similar instead of GetInvocationListAsEnumerable. See string.EnumerateRunes() as an example of a similar pattern:
Updated. Thank you!
The API shape looks good to me. Thanks for updating it @jkotas.
Anyone object to marking this "ready for review" and in the 9.0 milestone?
Would changing IsSingleInvocation
to an InvocationCount
make sense here?
Would changing
IsSingleInvocation
to anInvocationCount
make sense here?
I don't think the count is stored currently? It probably would need to iterate the enumerator to calculate the count, the check for single/multicast delegates is the common case checked for and should be cheap
Invocation list is stored as array. So to get invocation count, just return length of the array or return 1 if it is single invocation. (If this is true: https://github.com/dotnet/runtime/issues/41849#issuecomment-1826404517)
So to get invocation count, just return length of the array
The array can have extra space at the end.
Current delegate multicast implementations have the count stored in one of the other delegate fields, so it is still possible to implement without iterating over the whole list. InvocationCount is slightly less efficient than IsSingleInvocation for the common pattern:
class Delegate
{
public bool IsSingleInvocation => m_helperObject is not Delegate[];
public int InvocationCount => (m_helperObject is Delegate[]) ? (int)m_extraFunctionPointerOrData : 1;
}
...
void Example1(Action myDelegate)
{
if (myDelegate.IsSingleInvocation)
{
// ... path optimized for single-cast delegates ...
}
else
{
// ... general path using enumerator ...
}
}
void Example2(Action myDelegate)
{
if (myDelegate.InvocationCount == 1)
{
// ... path optimized for single-cast delegates ...
}
else
{
// ... general path using enumerator ...
}
}
Maybe we should have both methods? Similar to how we have Span.IsEmpty
and Span.Length
.
Maybe we should have both methods?
If both are exposed, would a method that'd copy the delegates into a provided span make sense too?
If both are exposed, would a method that'd copy the delegates into a provided span make sense too?
We can always add it later if somebody comes up with a good use case for it. The proposal covers the motivating scenarios as is.
IsSingleInvocation
and InvocationCount
aren't related to the invocation list and thus shouldn't be expose
GetInvocationList().Length
that would be fineDelegate
is more desireable to not add one more type to the System
namespace. Also, we don't think we're likely adding more APIs to DelegateExtensions
.IEnumerable<T>
to make the enumerator more usefulGetInvocationList()
is quite high(apisof.net, grep.app)
EnumerateInvocationList()
namespace System;
public class Delegate
{
// Existing API
// public virtual System.Delegate[] GetInvocationList();
public static InvocationListEnumerator<TDelegate> EnumerateInvocationList<TDelegate>(TDelegate d) where TDelegate : Delegate;
public struct InvocationListEnumerator<TDelegate> where TDelegate : Delegate, IEnumerable<TDelegate>, IEnumerator<TDelegate>
{
public TDelegate Current { get; }
public bool MoveNext();
public InvocationListEnumerator<TDelegate> GetEnumerator() => this;
}
}
Cursory look through grep.app seems to suggest that many users would be happy with
EnumerateInvocationList()
Looking through existing users won't show you the potential of a fast path that just checks for single/multicast delegates distinction, since the currently existing API always requires you to allocate the array there is no point of building a fast path so nobody does.
All usages I ever have written could have used a fast path and not calling GetInvocationList() at all, just use the delegate you get passed directly. Calling GetInvocationList() is only required to be correct for edge cases where users pass multicast delegates which is a tiny amount of callers in my experience.
Given the enumerator is a struct and doesn't allocate I probably can write a check for a fast path on top of it but it feels like a lost opportunity to not give a public API for checking for the fast path.
Background and Motivation
Currently,we have an API to invocation list of MulticastDelegate,it is MulticastDelegate.GetInvocationList().
It allocates array every time we call it. And maybe that's why we have internal HasSingleTarget property in it.
We need handsome,and performant API.
Proposed API
Original rejected proposal