Open jhalterman opened 4 years ago
Hi Jonathan,
I would be very happy to see an option for timeouts to return completed promises before the user-provided supplier completes. Actually, the fact that this is not the default behavior confused me quite a bit when I started using the library, which was not long ago.
My two cents!
@Tembrel I'm curious what you think about this feature. It's come up a few times, #263 being one other place. I definitely wouldn't make this the default behavior for Timeout, but I don't mind adding it as an option with some clear caveats in the docs. Any thoughts? Also any thoughts on abandon
? I also thought of "abort" but in this case, that's more analogous to what we do when we attempt to cancel or interrupt. Abandon goes beyond that and just gives up.
Here's a PR that implements this feature: #284
Seems reasonable to me. "Abandon" is a good word for this, with the implication that the task (supplier) is left to its own devices. Might want to add a warning somewhere that repeated abandonment could lead to thread exhaustion.
After sleeping on it, I'm a little unsure of whether this is even a good idea or if it's too dangerous. Some reasons it might be too dangerous:
I'm wondering if there's a good enough use case for this to justify the potential danger. Perhaps @timothybasanov or @MidnightRider13 could weigh in with a use case.
Not a Contribution.
I fully agree that this is a potentially dangerous behaviour change. In my case, I'm ready to pay almost any price to get timeouts to work. In the current implementation, I can't even reliably report metrics on failures, or return an error response to the user within a guaranteed timeframe. I don't mind too many RPC requests in-flight as they will be short-circuited by a circuit breaker, and I don't mind too many executions abandoned as I know in advance which ones these could be, and I can give them a much bigger thread pool.
There are a few things you may consider to make it better:
Right now we follow this pattern for RPC calls which kinda reimplements "abandon timeouts" by hand:
Failsafe.timeout(100ms, 10ms).getAsync(ctx -> rpc.call(ctx.timeout())).join()
instead. Here "request context timeout" (90ms) is "request timeout" (100ms) minus "abandon timeout" (10ms)When would be useful to have staggered retries anyway? All of this applies even if my code handles cancellations well:
Thanks for the response @timothybasanov. So it sounds like your overall goal is to respond with something, even if it's a timeout error, rather than be blocked waiting for an RPC call to be cancelled/interrupted? Is it possible that even if the RPC call can be interrupted, it might not be fast enough for your response time requirements?
Allow to report a metric on number of abandoned executions (a listener or a callback?)
It would probably make sense to put such a metric on the Timeout object. What do you think?
Give an additional parameter to configure a delay after which an execution is considered abandoned e.g. 10ms for an execution to finish after Failsafe cancels it. This way, if it completes in 1ms after a cancellation, I can get the original InterruptedException into a stack trace that will be logged.
If a Timeout occurs, the execution attempt from Failsafe's perspective will be completed with TimeoutExceededException
, regardless of whether the execution is abandoned or eventually returns some other result. That being the case, is an additional delay between cancelling and abandoning needed? Can't your code still log the InterruptedException?
If it was not cancelled correctly, complete a future from Failsafe, and report it somehow. This way I can report failure metrics and send an error to a user.
You mean even if we abandon an execution you'd want to know if it completes so we can get a sense for how responsive executions have been?
Ideally I want to send a request, wait 10 ms, send another request, wait 20ms, and send a third request. Set a global timeout (across all retries and delays) to 100ms. Whichever request finishes first, return it to the end-user. This way even if the first request is stuck, I still can respond to the user within 10+10ms, and it will not take more than 100ms to report an error
This is a really interesting use case, something like this?
Timeout<Object> globalTimeout = Timeout.of(Duration.ofMillis(100)).withInterrupt(true).withAbandon(true);
Timeout<Object> requestTimeout = Timeout.of(Duration.ofMillis(10)).withInterrupt(true).withAbandon(true);
RetryPolicy<Object> retryPolicy = ...
Failsafe.with(globalTimeout, retryPolicy, requestTimeout).run(this::rpc);
With 2.4.2, timeouts work more as expected in this way, where they can be placed outside or inside a retry policy.
Here, "request context timeouts" would also be useful, so that I can pass 100-10ms, 100-10-10ms, 100-10-20-10ms to each of three RPC calls
Can you explain more what you mean by this?
It has been some time since I reviewed the issue under discussion, so please take my input with a grain of salt and forgive any misconceptions I may have about the existing behavior..
It seemed to me that the existing Failsafe behavior violated the principle of least astonishment. If a user configures a timeout, I think the most reasonable expectation is that the timeout will trigger the configured behavior as soon as the threshold is breached, not at some indeterminate time in the future. To me, a timeout that does not act at the configured threshold fails to conform to the commonly understood notion of what it means to be a timeout.
In my opinion, the risk of making it too easy for users to abuse system resources (e.g. thread-pools) is the lesser sin compared to asking users to reimagine what it means for something to be a timeout, if that makes sense. Granted, it isn't necessarily an "either" "or" - adding the abandon option makes it possible for users to pick the desired behavior.
Thanks for the response @MidnightRider13. Do you have any concerns about abandoned threads potentially piling up or never completing? @timothybasanov's suggestion of adding some metrics to be able to inspect this information might help mitigate this risk.
@jhalterman @timothybasanov I see thread exhaustion as a very real concern when abandoning threads. Any additional guidance from Failsafe on navigating the issue would be a plus in my eyes, whether it is a word-of-caution in the documentation, metrics in the code, etc.
Again, been awhile since I was working on this, but it seems Failsafe offers a very elegant solution to this problem already in the form of the circuit breaker?? If units-of-work are frequently being abandoned due to a breached timeout policy, I would imagine the correct response would frequently be to open the circuit?
I see thread exhaustion as a very real concern when abandoning threads. Any additional guidance from Failsafe on navigating the issue would be a plus in my eyes, whether it is a word-of-caution in the documentation, metrics in the code, etc.
But to be clear, you still think it's a feature worth having?
I would imagine the correct response would frequently be to open the circuit?
Sure, that could work, maybe something like this:
Timeout<Object> timeout = Timeout.of(Duration.ofMillis(100));
CircuitBreaker<Object> cb = new CircuitBreaker()
.handleIf((TimeoutExceededException) failure -> failure.isAbandon());
.withFailureThreshold(3);
Failsafe.with(cb, timeout).getAsync(this::foo);
This would open the circuit after the last 3 failures were abandoned. But that doesn't necessarily tell us how many abandoned threads there were total, or how many still might be running. We could probably use a metric like that. For example:
Timeout<Object> timeout = Timeout.of(Duration.ofMillis(100));
CircuitBreaker<Object> cb = new CircuitBreaker()
.handleIf(f -> timeout.getCurrentAbandonedCount() >= 5);
Failsafe.with(cb, timeout).getAsync(this::foo);
This would open the breaker if the "currentAbandonedCount" after an execution is 5. Depending on what our goal is here, such as to wait for the "currentAbandonedCount" to come down a bit, this might be a bit awakward of a way to accomplish this.
Another option that might be more natural would be to just use a fixed size threadpool/ExecutorService, where enough abandoned threads means eventually the pool is fully utilized and execution attempts just block while waiting on a thread.
Not a Contribution.
Yep, something along these lines would work for me. In practice, even if Timeout does not expose anything about abandoned thread, I can always make it work by doing all the actual work on a separate thread pool where I control its size and other metrics. If Timeout will expose number of executions that were abandoned (since JVM startup or in a current moment in time), or any other similar stat, I'll make it work for me.
Let me try to be more specific on the desired behaviour or Failsafe, giving a very specific usage scenario:
Timeout<Object> timeout = Timeout.of(Duration.ofMillis(100)).abandonAfter(Duration.ofMillis(10)).withInterrupt(true);
Object data = Failsafe.with(timeout).getAsync(ctx-> loadDataViaRpc(ctx.getAttemptTimeoutDuration() * 0.9)).join();
loadDataViaRpc(100ms)
is called, and it tries its best to finish, say, within 90ms (to include some potential post-processing overhead). One of the expected results is RpcCallTimeoutException
if a call can not be finished within 90ms. Using this exception as a "cause" for exception that would fail end-user request will provide all the necessary context (e.g. server name)loadDataViaRpc()
misbehaved and has not finished within 100ms, Failsafe interrupts that future/thread. One of the expected causes of this is some post-processing task (e.g. serialization, metrics reporting) unexpectedly taking 200ms.loadDataViaRpc()
can correctly handle an interrupt, it will quickly throw some kind of AppTimeOutException(InterruptedException)
. Using this exception to fail a user request will give some context on what happened, i.e. some particular piece of code took too long to execute and has not finished within 10ms given to process RPC response. We will not have all the details of RpcCallTimeoutException
, but it's better than nothing.loadDataViaRpc()
misbehave so badly that it does not even handle an interrupt, Failsafe gives up (abandons it) after 10ms and returns failsafe.TimeoutExceededException
to the caller. This exception does not even know which piece of application code has failed to respond to an interrupt, so it will be least useful when logged. Still, it will allow to return some response to the end user only after 110ms spent within this loading code.In short, there should be some way to distinguish between several different scenarios:
TimeoutExceededException
to the end userIn practice, the last case almost never happens in real life, but we still should be able to detect it and handle it, so these issues could be reported and fixed.
Not a contribution.
I moved out Support staggered retries #291 and Make some internal APIs more open #292 into separate issues.
@timothybasanov Thanks for describing your desired scenario. I'm curious, can you tell me more about how your code handles a cancellation or interruption?
At first glance this scenario seems a natural fit for a Fallback. But it begs the question of whether the information that's contained in an RpcCallTimeoutException
or AppTimeOutException
is accessible only from within the Runnable/Supplier being executed, or if it's possible to provide that info to a Fallback. Can you share more detail here?
One idea (that still needs some more thought) is that the Runnable/Supplier could store information about the call (request id, server name, etc) inside the ExecutionContext, which could then be available to outer policies like Fallbacks, even if the execution ends up being cancelled. Ex:
Fallback.ofException(e -> {
return new RpcCallTimeoutException(e.get("serverName"), e.get("requestId"))
}).handle(TimeoutExceededException.class);
Timeout timeout = Timeout.of(Duration.ofMillis(100));
Failsafe.with(fallback, timeout).getAsync(ctx -> {
String serverName = config.getServerName();
String requestId = createRequestId();
ctx.set("serverName", serverName);
ctx.set("requestId" requestId);
return makeRequest(serverName, requestId);
});
If a Fallback can't be used, then we would need some new capability within Timeout to provide a grace period for an execution to naturally complete after a timeout/interruption, before a TimeoutExceededException
is recorded. Is this what you're thinking?
Not a Contribution.
Cancellation examples:
Fallbacks can't really be used. Some example details that only the underlying RPC framework may know during an execution: "Tried to resolve a hostname that we got dynamically", "RPC context contained this metadata set in an interceptor". In practice, in a place where I configure a failsafe retry I only have a lambda to call so I don't have access to any context.
Yep, that's exactly what I'm thinking!
Currently, Failsafe's Timeout policy will wait until a user-provided supplier has completed before completing. This means that the Future that Failsafe returns to a user can't be completed, even if a timeout is triggered, until the user's supplier completes. This may not be desirable behavior.
That said, if a Timeout can return a completed promise before the user-provided supplier is complete, that means an outer RetryPolicy could trigger additional executions of the user-provided supplier while previous executions are still running, which in turn may time out and pile up. I believe this is to be expected for some use cases, but it's an important behavioral change to consider.
Do we want Timeout to always wait for a supplier to complete before allowing outer policies to proceed, such as with retries? Should abandoning an execution at least be an option (
timeout.withAbandon()
)? Thoughts welcome.