Open ReubenBond opened 2 years ago
I don't see any reference to stream message recipients execution duration limits. Currently we are using SMS, so obviously the grain call semantics apply. However we clearly need to be able to control the execution time limits in this use case as well.
Having control over the stream provider time execution would be too coarse, but that might be good enough.
Great point, @amccool. Streams are conceptually separate from grain method calls, but they use calls under the hood. It seems to me that the grain extension interface which Orleans uses under the covers to make stream message delivery work should be marked as long-running.
This is likely true for Reminders, too.
I can see an argument against marking Reminder (and perhaps Streams) methods as [LongRunning]
: there is a chance that deadlocks can occur (depending on what the application code does), or that one long-running reminder would block another (perhaps short-running) reminder because the method/grain is not interleaving/reentrant. It might not be an issue, and it might be solvable (eg, using cycle detection for long-running calls, and allowing 'interleaving'/'non-interleaving' sections of app code within a method, eg via a using
block)
Regarding stream producers I am not sure I think that a consumer error (timeout or not) should be propagated back to the producer. The producer should only get delivery exceptions to the provider. AFAIR in SMS unless the fire/forget is marked the producer does get the timeoutexception
.
IMHO - let reminders be big and phat, nearly unbounded, unless defined by the user. The down side is the usual case of scheduling too many big operations and each one can not complete before the next one is scheduled.
As a user, I vote for option 5), requiring the [LongRunning]
attribute and keeping the default timeout for everything else.
I always sell the 30s timeout as Orleans enforcing a responsiveness guarantee for us, and I've seen benefits in the way it prods developers to think about how code should be structured so it's responsive - meaning either it completes within the timeout or it fails fast. So I think that behaviour should stay by default, as it's highly educational and the timeouts are useful to detect problems in the design.
When some methods must indeed be long running, I think we should be explicit about it, so we don't get things hanging unexpectedly when they should timeout as usual. It would also be nice to specify a custom timeout, maybe a static one on the attribute itself, or an expectation set by the caller somehow, e.g. via the RequestContext
as stated. This would only work if the implementation is marked as [LongRunning]
.
It would be extra nice to have some other way to cancel long-running distributed workloads. The GrainCancellationToken
s work just fine, but they look odd in user code as we often need to tie them with regular cancellation tokens.
Is there a clear solution for this issue yet?
We could also support having a single CancellationToken argument in grain interface method signatures.
I think this makes sense - i.e., to use the standard .NET async method cancellation pattern. Developers should understand that async methods need to flow cancellation tokens down to all invoked async methods.
I think a global configuration that specifies all calls to be long-running makes sense. Ultimately, that is setting the timeout for all grain method calls to be infinite. I think a [Timeout]
attribute applied to methods also makes sense (to override the global timeout setting), where a parameter specifies the timeout and also allows an infinite timeout to be specified.
How are grains notified when the silo containing the grains is being stopped (i.e., Silo.StopAsync
has been invoked)? Long-running in-flight grain operations arguably need to be informed when their containing silo is being stopped so those long-running operations can be stopped gracefully. But there also needs to be to immediately cancel all in-flight grain operations when the cancellation token passed to Silo.StopAsync(CancellationToken)
is cancelled.
As such, there would arguably be benefit in making two cancellation tokens available to grain methods - one triggered when Silo.StopAsync
is invoked, and the other triggered when the cancellation token passed to Silo.StopAsync(CancellationToken)
is triggered (or if/when the Silo.StopAsync
operation times out).
Perhaps the cancellation token passed as a grain method parameter should cover the latter case (i.e., immediately cancels the method) and the former case (when Silo.StopAsync
is invoked) be covered with a cancellation token made available via dependency injection? This cancellation token triggered when Silo.StopAsync
is invoked would only be needed for long-running grain operations.
I think a global configuration that specifies all calls to be long-running makes sense. Ultimately, that is setting the timeout for all grain method calls to be infinite.
Infinite timeouts across the board could lead to unexpected behavior, eg calls which hang without any indication.
Opting in to arbitrarily long-running calls only when the method has a CancellationToken
parameter is the right approach, I believe. It's intuitive enough and pushes developers into a good practice. It can be verbalized as "The default cancellation for every call is the global ResponseTimeout
value, but you can override that by providing your own CancellationToken
"
How are grains notified when the silo containing the grains is being stopped (i.e., Silo.StopAsync has been invoked)?
DeactivateAsync
, which now (in 4.x) includes a reason code. For example, DeactivationReasonCode.ShuttingDown
in the case of app shutdown.
Infinite timeouts across the board could lead to unexpected behavior, eg calls which hang without any indication.
Yes good point. It would be bad to eliminate the timeout safety net for resolving deadlocks caused by cycles on non-interleaved/re-entrant grains on a global basis.
Opting in to arbitrarily long-running calls only when the method has a CancellationToken parameter is the right approach, I believe. It's intuitive enough and pushes developers into a good practice. It can be verbalized as "The default cancellation for every call is the global ResponseTimeout value, but you can override that by providing your own CancellationToken".
Presumably you mean you can override that by providing your own CancellationToken
and by decorating the method with an attribute that overrides the configured default/global timeout behaviour? I would imagine that the default behaviour for a method with a CancellationToken
parameter would be for that token to be triggered if the timeout expires. i.e., you'd need to provide a CancellationToken
parameter and override the default timeout behaviour for a method to be "long-running".
Perhaps a [Timeout]
attribute could be applied to a grain method to specify the desired timeout for that method? i.e., an infinite timeout could be specified on a per-method basis, not as a global default. i.e., [Timeout(TimeSpan.InfiniteTimeSpan)]
would have the same semantics as the [LongRunning]
attribute proposed above.
DeactivateAsync, which now (in 4.x) includes a reason code. For example, DeactivationReasonCode.ShuttingDown in the case of app shutdown.
Excellent. I assume you mean OnDeactivateAsync
? Should an override of this method be provided that accepts a CancellationToken
? i.e., should a timeout be configurable for deactivating a grain? If Silo.StopAsync(CancellationToken)
is invoked, then should the CancellationToken
not flow to the OnDeactivateAsync
method?
Presumably you mean you can override that by providing your own
CancellationToken
and by decorating the method with an attribute that overrides the configured default/global timeout behaviour? I would imagine that the default behaviour for a method with aCancellationToken
parameter would be for that token to be triggered if the timeout expires. i.e., you'd need to provide aCancellationToken
parameter and override the default timeout behaviour for a method to be "long-running".
I mean just having a cancellation token parameter. I think [LongRunning]
(etc) shouldn't be necessary in that case.
Excellent. I assume you mean OnDeactivateAsync? Should an override of this method be provided that accepts a
CancellationToken
?
Yes, it does accept a CancellationToken
. It cancels after a configurable timeout, but we could also cancel it once the StopAsync
cancellation token is canceled. Feel free to open an issue to discuss/track that, since it's not related to this issue.
Similarly, for [Timeout(TimeSpan)]
, that should be discussed on a new issue, linking back to https://github.com/dotnet/orleans/issues/4328
I mean just having a cancellation token parameter. I think [LongRunning] (etc) shouldn't be necessary in that case.
Personally, I would expect the CancellationToken
passed as a parameter to be triggered if/when the timeout expires. I would find it counterintuitive for the presence of a CancellationToken
parameter to override the default timeout behaviour to be infinite. Apologies if I have misunderstood your proposal.
What you're suggesting sounds fine to me, too.
We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.
Note that gRPC provides a heartbeating mechanism. This would be another benefit of Orleans using gRPC for all grain-to-grain communication (using whatever serializer is configured - it doesn't have to be Protobuf) as I proposed here.
I like [LongRunning]
and leaving everything the same.
I'd like to see an overload for [LongRunning(Timeout="00:02:00"]
to allow for setting timeout on individual grain method.
Also, a mechanism to set how often a heartbeat should happen before it's killed: [LongRunning(Timeout="01:00:00", HeartbeatInterval="00:00:30" ]
.
Manually implementing polling is tricky and error-prone. Polling introduces an (arbitrarily chosen) heartbeat delay, which reduces responsiveness. Depending on the chosen delay and the actual elapsed time of the method, polling also introduces overhead which may be non-negligible. In high traffic scenarios, it might turn detrimental.
Extending the timeout for individual methods (e.g. [LongRunning]
) is a small improvement, but the overridden timeout is still an arbitrary choice, which I don't like.
I used to prefer the idea of a grain callback, and had this implemented straight-forwardly in a grain-to-grain scenario. However, I now find I need to support client-to-grain calls as well, where a callback seems impossible (?).
My counter-proposal is to add support for IAsyncEnumerable<T>
as the grain method result type, and leverage this for long-running methods. (#6504 #7830)
A long-running grain method may then periodically (at, say, half the timeout span) yield return
a T
which indicates that processing is still running, and, importantly, resets the timeout timer. The final T
is the end result. An exception must be propagated naturally. (The interpretation of T
is part of the grain contract - it can even include progress information or partial results.)
Supporting IAsyncEnumerable<T>
obviously also enables a host of other lightweight streaming scenarios (like GRPC, and without the need for an expensive streaming backend).
Any updates? This is painfully missing. Would really like to see support for this added to Orleans soon.
It's common for developers to want long-running grain calls, eg calls which routines run for 10s of seconds or an arbitrarily long time.
Currently, there are two ways to accomplish this, and both have significant detractors:
SiloMessagingOptions.ResponseTimeout
andClientMessagingOptions.ResponseTimeout
to some a value large enough to cover the longest call. This has the downside that all calls have the same timeout and therefore they suffer the consequences (eg, deadlocks take longer to unwind, client calls to a failed silo may take much longer to fault, leaving the client waiting on a response which will never arrive).Proposal
We introduce support for arbitrarily long-running calls by leveraging a heart-beating mechanism.
Mechanism
6672 added functionality to send updates for long-running requests to callers. The original purpose was to aid in diagnostics for blocked calls (eg, synchronous blocking, slow database connection, request deadlocks). This mechanism can be used to extend the timeout for a request. It could be limited to messages which are allowed to be long-running (i.e., dependent on a policy)
Policy
We can consider the policy for long-running calls separately. Here are some options, which aren't all mutually exclusive:
RequestContext
is the appropriate place to interact with such functionality, but that's TBD[LongRunning]
attribute are allowed to be long-running. In terms of implementation, this could result in anInvokeMethodOptions
value being set which tells the caller to extend the timeout when a status update is received. This is likely the easiest conservative designConsiderations
[LongRunning]
messages which have expired. We could extend that to all messages (not just[LongRunning]
) with little/no negative effect.GrainCancellationToken
, but that is not ideal: it is not pay-for-what-you use and currently requires scanning all arguments, which therefore boxes every argument in order to perform the type check. It was implemented with what was available at the time, but we have completely rewritten our RPC system since then and could add support for grain cancellation tokens. It seems that this proposal touches on the same area, since it's about timeouts (cancellation, essentially - would it be useful to cancel abandoned calls automatically?). UsingRequestContext
to flow aCancellationToken
down the call chain seems appropriate. We could also support having a singleCancellationToken
argument in grain interface method signatures. That could then be flowed to theRequestContext
, but it's obvious that there is room for gotchas, like when both anCancellationToken
argument and aRequestContext.Canceled
token is present, but the former isn't flowed by the developer down the chain and the latter is implicitly flowed. We should expand in another issue/proposal.[OneWay]
calls? Currently, we do not register[OneWay]
calls with the local callback collection, so they are entirely untracked by the caller (since timeouts are not possible). If there is a programmatic way to extend call duration, that information would be ignored by the caller. A similar issue exists, for cancellation: if cancellation is lazy (eg, waits until the next heartbeat) then there would be no rendezvous point.Related issues
4328