dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.05k stars 2.02k forks source link

[Proposal] Support long-running grain calls using a heartbeating mechanism #7649

Open ReubenBond opened 2 years ago

ReubenBond commented 2 years ago

It's common for developers to want long-running grain calls, eg calls which routines run for 10s of seconds or an arbitrarily long time.

Currently, there are two ways to accomplish this, and both have significant detractors:

  1. Configure SiloMessagingOptions.ResponseTimeout and ClientMessagingOptions.ResponseTimeout to some a value large enough to cover the longest call. This has the downside that all calls have the same timeout and therefore they suffer the consequences (eg, deadlocks take longer to unwind, client calls to a failed silo may take much longer to fault, leaving the client waiting on a response which will never arrive).
  2. Use a polling/callbacks pattern: work is started in the background using one method call and the client either continuously checks for progress using an additional call (maybe a single method is used and invoked repeatedly) or registers a callback. These techniques add complexity to both the grain code and the client code.

Proposal

We introduce support for arbitrarily long-running calls by leveraging a heart-beating mechanism.

Mechanism

6672 added functionality to send updates for long-running requests to callers. The original purpose was to aid in diagnostics for blocked calls (eg, synchronous blocking, slow database connection, request deadlocks). This mechanism can be used to extend the timeout for a request. It could be limited to messages which are allowed to be long-running (i.e., dependent on a policy)

Policy

We can consider the policy for long-running calls separately. Here are some options, which aren't all mutually exclusive:

  1. All calls are always allowed to be long-running
  2. A global configuration allows all calls to be long-running
  3. Interleaved/reentrant calls are allowed to be long-running
  4. Grains can request additional time to process a call programmatically. It would seem that RequestContext is the appropriate place to interact with such functionality, but that's TBD
  5. Grain interface methods marked with a [LongRunning] attribute are allowed to be long-running. In terms of implementation, this could result in an InvokeMethodOptions value being set which tells the caller to extend the timeout when a status update is received. This is likely the easiest conservative design

Considerations

Related issues

amccool commented 2 years ago

I don't see any reference to stream message recipients execution duration limits. Currently we are using SMS, so obviously the grain call semantics apply. However we clearly need to be able to control the execution time limits in this use case as well.

Having control over the stream provider time execution would be too coarse, but that might be good enough.

ReubenBond commented 2 years ago

Great point, @amccool. Streams are conceptually separate from grain method calls, but they use calls under the hood. It seems to me that the grain extension interface which Orleans uses under the covers to make stream message delivery work should be marked as long-running.

This is likely true for Reminders, too. I can see an argument against marking Reminder (and perhaps Streams) methods as [LongRunning]: there is a chance that deadlocks can occur (depending on what the application code does), or that one long-running reminder would block another (perhaps short-running) reminder because the method/grain is not interleaving/reentrant. It might not be an issue, and it might be solvable (eg, using cycle detection for long-running calls, and allowing 'interleaving'/'non-interleaving' sections of app code within a method, eg via a using block)

amccool commented 2 years ago

Regarding stream producers I am not sure I think that a consumer error (timeout or not) should be propagated back to the producer. The producer should only get delivery exceptions to the provider. AFAIR in SMS unless the fire/forget is marked the producer does get the timeoutexception.

amccool commented 2 years ago

IMHO - let reminders be big and phat, nearly unbounded, unless defined by the user. The down side is the usual case of scheduling too many big operations and each one can not complete before the next one is scheduled.

JorgeCandeias commented 2 years ago

As a user, I vote for option 5), requiring the [LongRunning] attribute and keeping the default timeout for everything else.

I always sell the 30s timeout as Orleans enforcing a responsiveness guarantee for us, and I've seen benefits in the way it prods developers to think about how code should be structured so it's responsive - meaning either it completes within the timeout or it fails fast. So I think that behaviour should stay by default, as it's highly educational and the timeouts are useful to detect problems in the design.

When some methods must indeed be long running, I think we should be explicit about it, so we don't get things hanging unexpectedly when they should timeout as usual. It would also be nice to specify a custom timeout, maybe a static one on the attribute itself, or an expectation set by the caller somehow, e.g. via the RequestContext as stated. This would only work if the implementation is marked as [LongRunning].

It would be extra nice to have some other way to cancel long-running distributed workloads. The GrainCancellationTokens work just fine, but they look odd in user code as we often need to tie them with regular cancellation tokens.

canitez commented 2 years ago

Is there a clear solution for this issue yet?

bill-poole commented 2 years ago

We could also support having a single CancellationToken argument in grain interface method signatures.

I think this makes sense - i.e., to use the standard .NET async method cancellation pattern. Developers should understand that async methods need to flow cancellation tokens down to all invoked async methods.

I think a global configuration that specifies all calls to be long-running makes sense. Ultimately, that is setting the timeout for all grain method calls to be infinite. I think a [Timeout] attribute applied to methods also makes sense (to override the global timeout setting), where a parameter specifies the timeout and also allows an infinite timeout to be specified.

How are grains notified when the silo containing the grains is being stopped (i.e., Silo.StopAsync has been invoked)? Long-running in-flight grain operations arguably need to be informed when their containing silo is being stopped so those long-running operations can be stopped gracefully. But there also needs to be to immediately cancel all in-flight grain operations when the cancellation token passed to Silo.StopAsync(CancellationToken) is cancelled.

As such, there would arguably be benefit in making two cancellation tokens available to grain methods - one triggered when Silo.StopAsync is invoked, and the other triggered when the cancellation token passed to Silo.StopAsync(CancellationToken) is triggered (or if/when the Silo.StopAsync operation times out).

Perhaps the cancellation token passed as a grain method parameter should cover the latter case (i.e., immediately cancels the method) and the former case (when Silo.StopAsync is invoked) be covered with a cancellation token made available via dependency injection? This cancellation token triggered when Silo.StopAsync is invoked would only be needed for long-running grain operations.

ReubenBond commented 2 years ago

I think a global configuration that specifies all calls to be long-running makes sense. Ultimately, that is setting the timeout for all grain method calls to be infinite.

Infinite timeouts across the board could lead to unexpected behavior, eg calls which hang without any indication.

Opting in to arbitrarily long-running calls only when the method has a CancellationToken parameter is the right approach, I believe. It's intuitive enough and pushes developers into a good practice. It can be verbalized as "The default cancellation for every call is the global ResponseTimeout value, but you can override that by providing your own CancellationToken"

How are grains notified when the silo containing the grains is being stopped (i.e., Silo.StopAsync has been invoked)?

DeactivateAsync, which now (in 4.x) includes a reason code. For example, DeactivationReasonCode.ShuttingDown in the case of app shutdown.

bill-poole commented 2 years ago

Infinite timeouts across the board could lead to unexpected behavior, eg calls which hang without any indication.

Yes good point. It would be bad to eliminate the timeout safety net for resolving deadlocks caused by cycles on non-interleaved/re-entrant grains on a global basis.

Opting in to arbitrarily long-running calls only when the method has a CancellationToken parameter is the right approach, I believe. It's intuitive enough and pushes developers into a good practice. It can be verbalized as "The default cancellation for every call is the global ResponseTimeout value, but you can override that by providing your own CancellationToken".

Presumably you mean you can override that by providing your own CancellationToken and by decorating the method with an attribute that overrides the configured default/global timeout behaviour? I would imagine that the default behaviour for a method with a CancellationToken parameter would be for that token to be triggered if the timeout expires. i.e., you'd need to provide a CancellationToken parameter and override the default timeout behaviour for a method to be "long-running".

Perhaps a [Timeout] attribute could be applied to a grain method to specify the desired timeout for that method? i.e., an infinite timeout could be specified on a per-method basis, not as a global default. i.e., [Timeout(TimeSpan.InfiniteTimeSpan)] would have the same semantics as the [LongRunning] attribute proposed above.

DeactivateAsync, which now (in 4.x) includes a reason code. For example, DeactivationReasonCode.ShuttingDown in the case of app shutdown.

Excellent. I assume you mean OnDeactivateAsync? Should an override of this method be provided that accepts a CancellationToken? i.e., should a timeout be configurable for deactivating a grain? If Silo.StopAsync(CancellationToken) is invoked, then should the CancellationToken not flow to the OnDeactivateAsync method?

ReubenBond commented 2 years ago

Presumably you mean you can override that by providing your own CancellationToken and by decorating the method with an attribute that overrides the configured default/global timeout behaviour? I would imagine that the default behaviour for a method with a CancellationToken parameter would be for that token to be triggered if the timeout expires. i.e., you'd need to provide a CancellationToken parameter and override the default timeout behaviour for a method to be "long-running".

I mean just having a cancellation token parameter. I think [LongRunning] (etc) shouldn't be necessary in that case.

Excellent. I assume you mean OnDeactivateAsync? Should an override of this method be provided that accepts a CancellationToken?

Yes, it does accept a CancellationToken. It cancels after a configurable timeout, but we could also cancel it once the StopAsync cancellation token is canceled. Feel free to open an issue to discuss/track that, since it's not related to this issue.

Similarly, for [Timeout(TimeSpan)], that should be discussed on a new issue, linking back to https://github.com/dotnet/orleans/issues/4328

bill-poole commented 2 years ago

I mean just having a cancellation token parameter. I think [LongRunning] (etc) shouldn't be necessary in that case.

Personally, I would expect the CancellationToken passed as a parameter to be triggered if/when the timeout expires. I would find it counterintuitive for the presence of a CancellationToken parameter to override the default timeout behaviour to be infinite. Apologies if I have misunderstood your proposal.

ReubenBond commented 2 years ago

What you're suggesting sounds fine to me, too.

ghost commented 2 years ago

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

bill-poole commented 2 years ago

Note that gRPC provides a heartbeating mechanism. This would be another benefit of Orleans using gRPC for all grain-to-grain communication (using whatever serializer is configured - it doesn't have to be Protobuf) as I proposed here.

ElanHasson commented 2 years ago

I like [LongRunning] and leaving everything the same.

I'd like to see an overload for [LongRunning(Timeout="00:02:00"] to allow for setting timeout on individual grain method.

Also, a mechanism to set how often a heartbeat should happen before it's killed: [LongRunning(Timeout="01:00:00", HeartbeatInterval="00:00:30" ].

ugumba commented 1 year ago

Manually implementing polling is tricky and error-prone. Polling introduces an (arbitrarily chosen) heartbeat delay, which reduces responsiveness. Depending on the chosen delay and the actual elapsed time of the method, polling also introduces overhead which may be non-negligible. In high traffic scenarios, it might turn detrimental.

Extending the timeout for individual methods (e.g. [LongRunning]) is a small improvement, but the overridden timeout is still an arbitrary choice, which I don't like.

I used to prefer the idea of a grain callback, and had this implemented straight-forwardly in a grain-to-grain scenario. However, I now find I need to support client-to-grain calls as well, where a callback seems impossible (?).

My counter-proposal is to add support for IAsyncEnumerable<T> as the grain method result type, and leverage this for long-running methods. (#6504 #7830)

A long-running grain method may then periodically (at, say, half the timeout span) yield return a T which indicates that processing is still running, and, importantly, resets the timeout timer. The final T is the end result. An exception must be propagated naturally. (The interpretation of T is part of the grain contract - it can even include progress information or partial results.)

Supporting IAsyncEnumerable<T> obviously also enables a host of other lightweight streaming scenarios (like GRPC, and without the need for an expensive streaming backend).

aradalvand commented 5 days ago

Any updates? This is painfully missing. Would really like to see support for this added to Orleans soon.