dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Breaking change in AsyncTaskMethodBuilder.Task #26312

Closed gfraiteur closed 4 years ago

gfraiteur commented 6 years ago

Before .NET Core 2.1, AsyncTaskMethodBuilder.Task used to always return the same Task. Starting with .NET Core 2.1, it returns a different value depending on the state of the state machine. (PostSharp is currently relying on this behavior.)

Is that expected? Could you document which values are returned in which circumstances?

gfraiteur commented 6 years ago

Attaching a repro. Extract the zip file and execute e97d0f14c66440ee775be26c1f9a34cf.dll as a console app. netcoreapp2.0.zip

stephentoub commented 6 years ago

Can you share the source for the repro? And, actually, a test that doesn't use PostSharp? If there's a breaking change here, it'd be helpful to see the most minimal repro possible, written just against the APIs and without extra tooling required.

it returns a different value depending on the state of the state machine

Can you elaborate on this? You mean you access ATMB.Task on the same instance multiple times, and it returns a different Task instance each time? That is definitely not expected. Or do you mean that you might get back different Task-derived types from different ATMB instances depending on circumstances? If so, that is expected.

stephentoub commented 6 years ago

Could you document which values are returned in which circumstances?

No, I do not think that's a good idea. From the perspective of a consumer of the API, it gets back a Task; whether it's some special derived implementation is an implementation detail that we should be free to change and optimize in the future. The changes made for 2.1 were exactly because they yielded significant performance and diagnostic benefits, and if we'd previously documented exactly the concrete class returned, that wouldn't be possible. In general in .NET, it is not considered a breaking change to return a more derived type than the one explicitly specified as the return type, just as it's not considered breaking to return different implementations of an interface from a method declared to return that interface.

gfraiteur commented 6 years ago

Thank you for your reply.

Can you share the source for the repro?

Done. But you won't able to build it without a PostSharp build that is still unpublished. OnMethodBoundary_CurrentTask.zip

Can you share ... a test that does not use PostSharp ?

No, I don't think it is possible to access the current state machine class in C# without PostSharp. We inject code that exposes ATMB.Task to the aspect and this aspect does pretty similar things than your causality tracker (we're doing async logging "the right way", with causality).

Can you elaborate on ... it returns a different value depending on the state of the state machine

Here is the result of the test on .NET Core 2.0 (when you run the compiled attach file):

dotnet --fx-version 2.0.0 C:\src\postsharp\Private\Test\SmallTests\bin\Aspects\SlimAdvices\OnMethodBoundary_CurrentTask\CS\NetCore\Debug\netcoreapp2.0\e97d0f14c66440ee775be26c1f9a34cf.dll
  OnEntry Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnSuspend Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnResume Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnSuspend Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnResume Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnExit Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  Passed.

And here is the result on .NET Core 2.1:

  C:\src\postsharp\Build\Testing\..\..\Private\Test\BuildTestRunner\bin\Debug\BuildTestRunner.exe dotnet --fx-version 2.1.0-rc1 C:\src\postsharp\Private\Test\SmallTests\bin\Aspects\SlimAdvices\OnMethodBoundary_CurrentTask\CS\NetCore\Debug\netcoreapp2.0\e97d0f14c66440ee775be26c1f9a34cf.dll
  OnEntry Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnSuspend Task = System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
  OnResume Task = System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Threading.Tasks.VoidTaskResult,BuildUnitTests.Aspects.SlimAdvices.OnMethodBoundary_CurrentTask.Program+<AsyncMethod>d__1]

Our test shows that, on .NET Core 2.1, when we call ATMB.Task before the first await, we get a simple Task (presumably the object returned by InitializeTaskAsPromise). But when we call ATMB.Task after the first await, we get an AsyncStateMachineBox.

No, I do not think that's a good idea.

I understand. We need to know if ATMB.Task is supposed to be idempotent. If yes, the behavior in 2.1 is a bug and we will wait for a fix. If not, we will implement a workaround. Note that ObjectIdForDebugger already implements a workaround and we would be able to use it if it were not internal.

stephentoub commented 6 years ago

when we call ATMB.Task before the first await

Ah, you're code rewriting the compiler-generated code to access the Task property before the first await? That is not a supported scheme / access pattern, and I would not expect the ATMB implementation to change to accommodate it.

Note that ObjectIdForDebugger already implements a workaround and we would be able to use it if it were not internal.

ObjectIdForDebugger implements a workaround to accommodate the case where the debugger is poking at things; debuggers often access private state in an unsupported fashion, and if they break, they break. But note that this workaround makes the implementation more expensive at runtime, and that's considered ok, because this would only manifest if you were interacting with these objects with the debugger, at which point perf doesn't matter.

The issue is that while ATMB is meant for use by the compiler-generated pattern (and isn't really meant for use other than that, which is why it's in the compiler namespace), some perf-sensitive code does use ATMB directly rather than using TaskCompletionSource, in order to have a struct-based source to avoid an extra allocation. In the compiler-generated use, we can lazily initialize the backing field to be the single object that's not only the Task but also contains the async method state machine, contains the execution context, caches the Action delegate, etc. But that also means this data structure is much bigger than a normal task, and so if you're just using ATMB directly in code, all of that extra size is unnecessary and useless cost. So if you access the Task property directly, we continue to just manufacture a normal Task.

If not, we will implement a workaround.

If you can, that's the best the approach.

We could explore alternatives to try to make this more bulletproof, but off the top of my head I don't have any great solutions, and if we did implement something, I don't think they would be backported to 2.1 unless this became a huge issue. I can think of two possible approaches to us "fixing" this:

  1. When Task is accessed and the field hasn't been initialized, do the same thing that ObjectIdForDebugger does, and instantiate the AsyncStateMachineBox type. The problem with that is that ATMB when directly used would then be using a much larger object than is necessary, carrying around unnecessary and useless state, precisely in the scenario where the developer was trying to streamline. This is why the code goes out of its way to use the base task in such use cases, and I'd be extremely hesitant to change that.
  2. When this pattern is detected (the task has already been initialized and then the first await happens), create a new AsyncStateMachineBox then. At that point we would then need to hook up a continuation from the box that would complete the previously handed out task. This would require more thought and experimentation though, and I fear it would negatively impact the 99.9999% case by increasing the size of the builder. It would also make the code generated by PostSharp more expensive, so anyone using PostSharp would be taking a performance hit.
gfraiteur commented 6 years ago

Thank you for the detailed answer.

As a workaround, it seems enough for us to have an "ObjectId", so we'll add some field to the state machine and somehow initialize it. This will have some performance impact but this will only affect users of a feature that needs this ObjectId (such as logging) and, additionally, the performance overhead of PostSharp is much smaller than the desired effect, e.g. logging.

gfraiteur commented 6 years ago

The workaround will be implemented in PostSharp 6.0.

Here is the impact on users:

Projects targeting .NET Standard or .NET Core built with an older version of PostSharp will be affected by the defect when running on the .NET Core 2.1 runtime. People will have to recompile their project using PostSharp 6.0 (possibly renewing their support subscription).

The only affected PostSharp feature is logging (using PostSharp.Patterns.Diagnostics) of async methods. ATMB.Task is not exposed to other PostSharp public APIes and therefore PostSharp users could not rely on this implementation detail for their own aspects.

AArnott commented 6 years ago

In general in .NET, it is not considered a breaking change to return a more derived type than the one explicitly specified as the return type, just as it's not considered breaking to return different implementations of an interface from a method declared to return that interface.

I agree. But FWIW, this broke StreamJsonRpc which was using the .NET Standard-friendly subset of reflection APIs to find the Task<T>.Result property on the task returned from an async method. Since .NET Core 2.1 returns a Task<T>-derived type from async methods now, our reflection code fails to find the Result property and it malfunctions, returning bad data to the RPC client.

We'll fix it on streamjsonrpc, of course, but I wanted to +1 that this was an unexpected breaking change, even if it is a calculated risk by your standards. And again, I agree with your standard.

stephentoub commented 6 years ago

I wanted to +1 that this was an unexpected breaking change

You're saying you consider it a breaking change in a method:

public Foo Bar() { ... }

when Foo isn't sealed to return a Foo-derived type when previously it was returning a concrete Foo? That's never been the bar.

jnm2 commented 6 years ago

Any time you use reflection, bets are off unless you are making the same decisions the compiler would have made. By that standard, ignoring inherited Result properties was a bug.

AArnott commented 6 years ago

You're saying you consider it a breaking change in a method...

No. I'm saying it was a breaking change in that it literally broke our library.

But I was also trying to emphasize that I agree with your standard that you should continue to make the trade-offs you did for better performance. Your standard shouldn't consider it a breaking change because if people do the right thing, it won't break them. We obviously weren't doing the right thing here, and we'll fix that.

I just hope that such a change as this would never make it into a servicing release.

stephentoub commented 6 years ago

I'm saying it was a breaking change in that it literally broke our library.

Ok. That's of course possible with any change, any bug fix, any performance improvement, etc.

I was also trying to emphasize that I agree with your standard

Ok.

AArnott commented 6 years ago

@stephentoub I'm dumbfounded that this failure only occurs on some machines when running on .NET Core 2.1. Locally, I see the test failure due to this Task<T>-derived type all the time on Linux and Windows. But on appveyor and Travis CI, both which have .NET Core 2.1 installed, the test passes! The only thing I can imagine is that the Task<T>-derived type only shows up under certain conditions I am not controlling for. Any ideas?

My goal here is to have our CI/PR validation catch such failures as the one that .NET Core 2.1 manifested in our tests. But so far, I haven't had luck since it only repros locally.

stephentoub commented 6 years ago

the Task-derived type only shows up under certain conditions I am not controlling for. Any ideas?

This is from an async Task<T> method? Its result in 2.1 will be derived iff it completes asynchronously. Any chance it's completing synchronously?

AArnott commented 6 years ago

Thanks. At least some of the tests that fail locally have an await Task.Yield() in the async method, so it seems they couldn't be completing synchronously. With the hint you gave though, I'm going to write a test that narrows it down to just the minimal that should reproduce it, and I'll log like crazy and see what I get on the CI machines.

AArnott commented 6 years ago

So bizarre. So after adding logging, the tests all started failing. Then I removed the logging, and the tests are still failing on the CIs (both appveyor and Travis changed simultaneously). So weird. oh well.