Considering allowing a Future result to be completed immediately when a Timeout occurs

failsafe-lib / failsafe

Fault tolerance and resilience patterns for the JVM

https://failsafe.dev

Apache License 2.0

4.2k stars 296 forks source link

Considering allowing a Future result to be completed immediately when a Timeout occurs #231

Open jhalterman opened 4 years ago

jhalterman commented 4 years ago

Currently, Failsafe's Timeout policy will wait until a user-provided supplier has completed before completing. This means that the Future that Failsafe returns to a user can't be completed, even if a timeout is triggered, until the user's supplier completes. This may not be desirable behavior.

That said, if a Timeout can return a completed promise before the user-provided supplier is complete, that means an outer RetryPolicy could trigger additional executions of the user-provided supplier while previous executions are still running, which in turn may time out and pile up. I believe this is to be expected for some use cases, but it's an important behavioral change to consider.

Do we want Timeout to always wait for a supplier to complete before allowing outer policies to proceed, such as with retries? Should abandoning an execution at least be an option (timeout.withAbandon())? Thoughts welcome.

MidnightRider13 commented 4 years ago

Hi Jonathan,

I would be very happy to see an option for timeouts to return completed promises before the user-provided supplier completes. Actually, the fact that this is not the default behavior confused me quite a bit when I started using the library, which was not long ago.

My two cents!

jhalterman commented 3 years ago

@Tembrel I'm curious what you think about this feature. It's come up a few times, #263 being one other place. I definitely wouldn't make this the default behavior for Timeout, but I don't mind adding it as an option with some clear caveats in the docs. Any thoughts? Also any thoughts on abandon? I also thought of "abort" but in this case, that's more analogous to what we do when we attempt to cancel or interrupt. Abandon goes beyond that and just gives up.

jhalterman commented 3 years ago

Here's a PR that implements this feature: #284

Tembrel commented 3 years ago

Seems reasonable to me. "Abandon" is a good word for this, with the implication that the task (supplier) is left to its own devices. Might want to add a warning somewhere that repeated abandonment could lead to thread exhaustion.

jhalterman commented 3 years ago

After sleeping on it, I'm a little unsure of whether this is even a good idea or if it's too dangerous. Some reasons it might be too dangerous:

Abandoned threads are, as the name implies, totally abandoned, so there's no way to know if they're still running or when they might complete.
Abandoning an execution with retries would mean multiple abandoned threads might start to pile up, without limit.
Abandoned executions with retries would mean multiple concurrent executions could occur, which might be a problem depending on what the execution is doing.

I'm wondering if there's a good enough use case for this to justify the potential danger. Perhaps @timothybasanov or @MidnightRider13 could weigh in with a use case.

timothybasanov commented 3 years ago

Not a Contribution.

I fully agree that this is a potentially dangerous behaviour change. In my case, I'm ready to pay almost any price to get timeouts to work. In the current implementation, I can't even reliably report metrics on failures, or return an error response to the user within a guaranteed timeframe. I don't mind too many RPC requests in-flight as they will be short-circuited by a circuit breaker, and I don't mind too many executions abandoned as I know in advance which ones these could be, and I can give them a much bigger thread pool.

There are a few things you may consider to make it better:

Allow to report a metric on number of abandoned executions (a listener or a callback?). This way I can detect a bad state and do something about it (e.g. restart a server, fire an alert, file a bug to investigate)
Give an additional parameter to configure a delay after which an execution is considered abandoned e.g. 10ms for an execution to finish after Failsafe cancels it. This way, if it completes in 1ms after a cancellation, I can get the original InterruptedException into a stack trace that will be logged. If it was not cancelled correctly, complete a future from Failsafe, and report it somehow. This way I can report failure metrics and send an error to a user.
Doing retries in parallel may be a good idea overall. Staggered retries is a thing that's missing from the framework at a moment.

Right now we follow this pattern for RPC calls which kinda reimplements "abandon timeouts" by hand:

Use Failsafe with 100ms timeout for an RPC call overall
As RPC framework has its way of timing out (e.g. with propagation of cancellation to a server-side), we give it a slightly smaller timeout of 90ms
This way if there is a bug/slowness in RPC framework (e.g. it's overloaded), we still can respond to the user in 100ms
In most cases it's RPC framework timeout that kicks in first, so in the exception we get and log there are all the RPC-call-related details
It would be convenient for us to do Failsafe.timeout(100ms, 10ms).getAsync(ctx -> rpc.call(ctx.timeout())).join() instead. Here "request context timeout" (90ms) is "request timeout" (100ms) minus "abandon timeout" (10ms)

When would be useful to have staggered retries anyway? All of this applies even if my code handles cancellations well:

Usually an RPC request takes 10ms
Sometimes, it hits a backend that experiences a GC pause, so it may take 50ms to execute
Rarely it takes 1s to execute, we consider this an error and are not willing to wait that long
I can set a request timeout to 20ms, but this may cause too many retries when backends are a bit flakier than usual (and all respond within 50ms). With one retry, an expected response time would be 20+10ms
I can set a timeout to 100ms, but if the request is truly stuck (say it would take 1s to finish) then it will be very slow. With one retry, an expected response time would be 100+10ms.
Ideally I want to send a request, wait 10 ms, send another request, wait 20ms, and send a third request. Set a global timeout (across all retries and delays) to 100ms. Whichever request finishes first, return it to the end-user. This way even if the first request is stuck, I still can respond to the user within 10+10ms, and it will not take more than 100ms to report an error
Here, "request context timeouts" would also be useful, so that I can pass 100-10ms, 100-10-10ms, 100-10-20-10ms to each of three RPC calls

jhalterman commented 3 years ago

Thanks for the response @timothybasanov. So it sounds like your overall goal is to respond with something, even if it's a timeout error, rather than be blocked waiting for an RPC call to be cancelled/interrupted? Is it possible that even if the RPC call can be interrupted, it might not be fast enough for your response time requirements?

Allow to report a metric on number of abandoned executions (a listener or a callback?)

It would probably make sense to put such a metric on the Timeout object. What do you think?

Give an additional parameter to configure a delay after which an execution is considered abandoned e.g. 10ms for an execution to finish after Failsafe cancels it. This way, if it completes in 1ms after a cancellation, I can get the original InterruptedException into a stack trace that will be logged.

If a Timeout occurs, the execution attempt from Failsafe's perspective will be completed with TimeoutExceededException, regardless of whether the execution is abandoned or eventually returns some other result. That being the case, is an additional delay between cancelling and abandoning needed? Can't your code still log the InterruptedException?

If it was not cancelled correctly, complete a future from Failsafe, and report it somehow. This way I can report failure metrics and send an error to a user.

You mean even if we abandon an execution you'd want to know if it completes so we can get a sense for how responsive executions have been?

Ideally I want to send a request, wait 10 ms, send another request, wait 20ms, and send a third request. Set a global timeout (across all retries and delays) to 100ms. Whichever request finishes first, return it to the end-user. This way even if the first request is stuck, I still can respond to the user within 10+10ms, and it will not take more than 100ms to report an error

This is a really interesting use case, something like this?

Timeout<Object> globalTimeout = Timeout.of(Duration.ofMillis(100)).withInterrupt(true).withAbandon(true);
Timeout<Object> requestTimeout = Timeout.of(Duration.ofMillis(10)).withInterrupt(true).withAbandon(true);
RetryPolicy<Object> retryPolicy = ...

Failsafe.with(globalTimeout, retryPolicy, requestTimeout).run(this::rpc);

With 2.4.2, timeouts work more as expected in this way, where they can be placed outside or inside a retry policy.

Here, "request context timeouts" would also be useful, so that I can pass 100-10ms, 100-10-10ms, 100-10-20-10ms to each of three RPC calls

Can you explain more what you mean by this?

MidnightRider13 commented 3 years ago

It has been some time since I reviewed the issue under discussion, so please take my input with a grain of salt and forgive any misconceptions I may have about the existing behavior..

It seemed to me that the existing Failsafe behavior violated the principle of least astonishment. If a user configures a timeout, I think the most reasonable expectation is that the timeout will trigger the configured behavior as soon as the threshold is breached, not at some indeterminate time in the future. To me, a timeout that does not act at the configured threshold fails to conform to the commonly understood notion of what it means to be a timeout.

In my opinion, the risk of making it too easy for users to abuse system resources (e.g. thread-pools) is the lesser sin compared to asking users to reimagine what it means for something to be a timeout, if that makes sense. Granted, it isn't necessarily an "either" "or" - adding the abandon option makes it possible for users to pick the desired behavior.

jhalterman commented 3 years ago

Thanks for the response @MidnightRider13. Do you have any concerns about abandoned threads potentially piling up or never completing? @timothybasanov's suggestion of adding some metrics to be able to inspect this information might help mitigate this risk.

MidnightRider13 commented 3 years ago

@jhalterman @timothybasanov I see thread exhaustion as a very real concern when abandoning threads. Any additional guidance from Failsafe on navigating the issue would be a plus in my eyes, whether it is a word-of-caution in the documentation, metrics in the code, etc.

Again, been awhile since I was working on this, but it seems Failsafe offers a very elegant solution to this problem already in the form of the circuit breaker?? If units-of-work are frequently being abandoned due to a breached timeout policy, I would imagine the correct response would frequently be to open the circuit?

jhalterman commented 3 years ago

I see thread exhaustion as a very real concern when abandoning threads. Any additional guidance from Failsafe on navigating the issue would be a plus in my eyes, whether it is a word-of-caution in the documentation, metrics in the code, etc.

But to be clear, you still think it's a feature worth having?

I would imagine the correct response would frequently be to open the circuit?

Sure, that could work, maybe something like this:

Timeout<Object> timeout = Timeout.of(Duration.ofMillis(100));
CircuitBreaker<Object> cb = new CircuitBreaker()
  .handleIf((TimeoutExceededException) failure -> failure.isAbandon());
  .withFailureThreshold(3);
Failsafe.with(cb, timeout).getAsync(this::foo);

This would open the circuit after the last 3 failures were abandoned. But that doesn't necessarily tell us how many abandoned threads there were total, or how many still might be running. We could probably use a metric like that. For example:

Timeout<Object> timeout = Timeout.of(Duration.ofMillis(100));
CircuitBreaker<Object> cb = new CircuitBreaker()
  .handleIf(f -> timeout.getCurrentAbandonedCount() >= 5);
Failsafe.with(cb, timeout).getAsync(this::foo);

This would open the breaker if the "currentAbandonedCount" after an execution is 5. Depending on what our goal is here, such as to wait for the "currentAbandonedCount" to come down a bit, this might be a bit awakward of a way to accomplish this.

Another option that might be more natural would be to just use a fixed size threadpool/ExecutorService, where enough abandoned threads means eventually the pool is fully utilized and execution attempts just block while waiting on a thread.

timothybasanov commented 3 years ago

Not a Contribution.

Yep, something along these lines would work for me. In practice, even if Timeout does not expose anything about abandoned thread, I can always make it work by doing all the actual work on a separate thread pool where I control its size and other metrics. If Timeout will expose number of executions that were abandoned (since JVM startup or in a current moment in time), or any other similar stat, I'll make it work for me.

Let me try to be more specific on the desired behaviour or Failsafe, giving a very specific usage scenario:

Timeout<Object> timeout = Timeout.of(Duration.ofMillis(100)).abandonAfter(Duration.ofMillis(10)).withInterrupt(true);
Object data = Failsafe.with(timeout).getAsync(ctx-> loadDataViaRpc(ctx.getAttemptTimeoutDuration() * 0.9)).join();

Define timeout as 100ms in general. This what outside code would expect calls to usually take at max
When loadDataViaRpc(100ms) is called, and it tries its best to finish, say, within 90ms (to include some potential post-processing overhead). One of the expected results is RpcCallTimeoutException if a call can not be finished within 90ms. Using this exception as a "cause" for exception that would fail end-user request will provide all the necessary context (e.g. server name)
In case loadDataViaRpc() misbehaved and has not finished within 100ms, Failsafe interrupts that future/thread. One of the expected causes of this is some post-processing task (e.g. serialization, metrics reporting) unexpectedly taking 200ms.
If loadDataViaRpc() can correctly handle an interrupt, it will quickly throw some kind of AppTimeOutException(InterruptedException). Using this exception to fail a user request will give some context on what happened, i.e. some particular piece of code took too long to execute and has not finished within 10ms given to process RPC response. We will not have all the details of RpcCallTimeoutException, but it's better than nothing.
If loadDataViaRpc() misbehave so badly that it does not even handle an interrupt, Failsafe gives up (abandons it) after 10ms and returns failsafe.TimeoutExceededException to the caller. This exception does not even know which piece of application code has failed to respond to an interrupt, so it will be least useful when logged. Still, it will allow to return some response to the end user only after 110ms spent within this loading code.

In short, there should be some way to distinguish between several different scenarios:

Underlying code was able to complete itself in time
Underlying code detected that it can not finish fast enough so it interrupted itself without Failsafe help
Underlying code hit a Failsafe timeout and was interrupted, but handled interruption fast and well
Underlying code failed a timeout and ignored an interruption, so Failsafe abandoned it and returned TimeoutExceededException to the end user

In practice, the last case almost never happens in real life, but we still should be able to detect it and handle it, so these issues could be reported and fixed.

timothybasanov commented 3 years ago

Not a contribution.

I moved out Support staggered retries #291 and Make some internal APIs more open #292 into separate issues.

jhalterman commented 3 years ago

@timothybasanov Thanks for describing your desired scenario. I'm curious, can you tell me more about how your code handles a cancellation or interruption?

At first glance this scenario seems a natural fit for a Fallback. But it begs the question of whether the information that's contained in an RpcCallTimeoutException or AppTimeOutException is accessible only from within the Runnable/Supplier being executed, or if it's possible to provide that info to a Fallback. Can you share more detail here?

One idea (that still needs some more thought) is that the Runnable/Supplier could store information about the call (request id, server name, etc) inside the ExecutionContext, which could then be available to outer policies like Fallbacks, even if the execution ends up being cancelled. Ex:

Fallback.ofException(e -> {
  return new RpcCallTimeoutException(e.get("serverName"), e.get("requestId"))
}).handle(TimeoutExceededException.class);
Timeout timeout = Timeout.of(Duration.ofMillis(100));

Failsafe.with(fallback, timeout).getAsync(ctx -> {
  String serverName = config.getServerName();
  String requestId = createRequestId();
  ctx.set("serverName", serverName);
  ctx.set("requestId" requestId);
  return makeRequest(serverName, requestId);
});

If a Fallback can't be used, then we would need some new capability within Timeout to provide a grace period for an execution to naturally complete after a timeout/interruption, before a TimeoutExceededException is recorded. Is this what you're thinking?

timothybasanov commented 3 years ago

Not a Contribution.

Cancellation examples:

Interrupted RPC call throws a custom exception that contains detailed information about the operation being done (e.g. resolution of a host within an SRV/TXT record), and an RPC context (e.g. unique request ID set in an interceptor)
Report a metric (again using an internal RPC request-specific information)
Cancel other request-related tasks/futures running in other thread pools or NIO resources (workarounds for an underlying RPC framework's issues)

Fallbacks can't really be used. Some example details that only the underlying RPC framework may know during an execution: "Tried to resolve a hostname that we got dynamically", "RPC context contained this metadata set in an interceptor". In practice, in a place where I configure a failsafe retry I only have a lambda to call so I don't have access to any context.

Yep, that's exactly what I'm thinking!