failsafe-lib / failsafe

Fault tolerance and resilience patterns for the JVM
https://failsafe.dev
Apache License 2.0
4.17k stars 297 forks source link

Timeout never triggers when application code ignores an interrupt #263

Open timothybasanov opened 4 years ago

timothybasanov commented 4 years ago

Under certain conditions Timeout policy may never trigger:

  1. Create a simple FailsafeExecutor with new Timeout(Durations.ofSeconds(1)).withCancel(true)
  2. Schedule a task while true();
  3. Try future.get()

Expected: com.apple.its.uts.common.retry.TimeoutExceededException after a second

Actual: Hangs.

Notes: User of the library expects timeouts to work even when some tasks are ignoring interrupts. Some example would be: application code notices the interrupt, but takes some time to wind down. If that "winding down" takes too long it breaks an assumption on the calling code that timeout always works.

Tembrel commented 4 years ago

In general, Failsafe users should not expect timeouts to stop the execution of tasks. If you really want to ensure that tasks stop when they time out, you must carefully craft them to be cancellable.

Quick summary of possible usage modes for Timeout depending on the tasks that will be executed under the policy and whether they will be executed asynchronously (Note: not talking about async execution contexts here, just getAsync vs get):

  1. If tasks are known to be responsive to cancellation via interruption, use withCancel(true)
  2. Otherwise, if tasks are known to be responsive to cancellation through polling Failsafe execution context, use withCancel(false)
  3. Otherwise, if tasks are executed asynchronously and are not known to support cancellation, either don't use withCancel at all or use withCancel(false) to prevent tasks that have been delayed past the timeout duration from starting in the first place (this would be a consideration with bounded schedulers). Either way, don't rely on the task being stopped once it has started.
  4. Otherwise (tasks are executed synchronously and are not known to support cancellation), don't use withCancel and, more importantly, only use Timeout in the first place if you are certain that tasks will complete on their own in reasonable time.

In the last two cases, you might, for example, only need to use the fact of the timeout within some policy chain and not rely on being able to stop the execution.

I come down strongly against using withCancel(true) except when the task is known to be responsive to cancellation via interruption. (The example given above in this issue — while (true); — is not such a task.) The reason is that in general an unexpected interruption in code that was not designed for it can have unpredictable behavior. You can relax this if you're confident that none of the tasks you execute under the timeout policy will have pathological behavior due to interruption.

In Java Concurrency in Practice, we provided some guidelines for writing code that keep the option of cancellability via interruption open, chief of which is to handle InterruptedException either by propagating it (after local cleanup, if necessary) or by catching it, setting the interrupt status, and returning as quickly as possible (again, after local cleanup). In the years since we wrote the book I have seen that no matter how carefully you follow these rules, if a task calls out to libraries that you do not control, in particular those that do I/O, it can be extremely difficult to ensure that it will behave properly in the presence of interruption. For this reason, unless the task is fairly simple, with obvious spots to check interrupt status and/or blocking calls that throw InterruptedException, I think the smart move is to use the Failsafe execution context where possible.

If you're uncomfortable having Failsafe execution contexts "leak" into your task code, consider hiding them behind a simple cancellation interface:

@FunctionalInterface
public interface CancellationContext {
    boolean isCancelled();
}

...
Result myTask(CancellationContext cc) {
    ...
    if (cc.isCancelled()) ... exit quickly ...
    ...
}
...
    Failsafe.with(timeout).getAsync(ctx -> myTask(ctx::isCancelled));

See my comments on a related Failsafe issues here and here for more background.

timothybasanov commented 4 years ago

This issue is not about task cancellation, it's about Failsafe Timeout behaviour. When I do future.get() on the Future returned from Failsafe I expect that it would finish after a Timeout I have defined. No matter type of a task I have scheduled. It was the shortest example that I could have thought of that has non-cancellable tasks. I agree with everything you said and I argue Failsafe should not depend on whether task is cancelable or not to make timeouts work.

This is not a contribution

Tembrel commented 4 years ago

I think you're asking too much of Timeout.

Timeout can't prevent a task (in general) from taking up an arbitrary amount of time, and tasks that take a lot of time should not be allowed to continue to silently use up a thread by proceeding with TimeoutExceededException as if things were fine. Hanging in this case seems like the best option. It's what would happen if you weren't using Failsafe.

timothybasanov commented 4 years ago

I see. These arguments make sense. I also just realized that this policy is applied consistently, i.e. when execution is "async" timeout is just completely ignored as it has no control over the async execution.

In my case I was migrating from get(3, SECONDS) to Failsafe's timeouts and getAsync so that timeouts and retry policies were configured in a single place. For me it was a significant change in behavior. Some of my tasks are CPU intensive, may take some time, and not under my control, but I still need to have timeouts for them to at least respond to the user. Overflowing pools are of a separate concern for me as I have metrics around them.

I still claim that this is a very surprising behaviour for the users and it should be documented on a timeout policy that it does not actually trigger a timeout. All engineers I talked to were deeply surprised by the current behaviour. May be at least give an option to the user to make it really timeout?

And as it is right now when something fails I do not get any logs or metrics (whenComplete is never called). I'll try to write my own timeout policy that does what I expect it to do. I dread this in advance as lots of methods are package-private in Failsafe, so there would be a lot of reflection involved (#257).

This is not a contribution.

jhalterman commented 3 years ago

When I do future.get() on the Future returned from Failsafe I expect that it would finish after a Timeout I have defined.

231 describes the issue at play here, where we attempt to interrupt an execution but don't want to abandon it if it refuses to complete, at least not by default. I just put up a PR for that in #284 if you care to have a look.