Closed timothybasanov closed 3 years ago
Indeed, Failsafe uses the Scheduler/ThreadPool you supply via .with(Scheduler)
for all async executions, including scheduled timeouts. In this case, if the thread pool is full, then not only do executions block but so do Timeout tasks. You could try to ensure that the thread pool is never fully utilized, but that may be a bit of a hack depending on your use case.
The simplest solution would be to allow policies such as Timeout to have a separately configured Scheduler if desired:
Timeout.of(Duration.ofSeconds(1)).withCancel(true).withScheduler(threadpool);
Thoughts?
That would work. And it would resolve the most immediate need today of making time outs work even when the app is slow.
This does feel like a bit of a hack. Not sure if a retry/fallback policies also need a fix or not, they schedule things. This would require a bit of a hack on my side as I'm using my own policies that allow to change delay/retry/thresholds at runtime. But it's workable solution for sure. A "proper" fix may take some time as it may change how threading works within Failsafe, and it even may require to fully define Failsafe behavior in the event of resource constraints.
I think the ideal solution would be to allow to move all Failsafe logic onto a separate thread pool. My application logic is slow and heavy, so I want a big thread pool. Failsafe-logic is fast and non-blocking, it can as well run on top of a common fork join pool. This way no matter what kind of application logic I'll throw at it, it would never break retries/timeouts and all other Failsafe policies (e.g. retries.)
Ideally I'd want to execute even policy listeners on the application thread pool. This may be hard to implement given that all listeners are currently done in a fully sync manner. To make things worse this desire clashes with the fact that I want to be able to execute policy listener even if my main thread pool is busy. Example of a listener would be adding debug information to the response when something is starting to fail (before ultimately sending it back via a fallback policy) even when main thread pool got fully stuck in a busy wait and ignoring cancellations.
One more related note on thread pools is that ideally Failsafe should create all callables that would execute application logic on threads from the application logic thread pool (even if the actual scheduling may not happen later on). This makes it much easier to make tracing/context passing work, as I don't have to think about Failsafe internal implementation at all.
I think the cheapest way with full backwards compatibility is to always schedule application logic via schedule(callable, 0)
and schedule all Failsafe logic via schedule(callable, !=0)
. It's very close to what Failsafe already does with a few exceptions.
Going further may be even changing Scheduler
interface to have two or three methods: Future scheduleApplicationLogic(callable)
, ScheduledFuture scheduleFailsafeCode(callable, delay)
, Future scheduleFailsafeCode(callable)
. It is convenient to have a separate method for 0-delay scheduling method as then I can return Future
instead. Non-scheduled future is much easier to create with if I already have complicated custom thread pools that can not support scheduling.
Sorry about such an unstructured response. I do not have any definitive answers and even some of my requirements are mutually exclusive. But I hope that some of these ideas would give you some insight into my difficulties.
This is not a contribution.
As with issue 263, I think we should limit our expectations of what the Timeout policy can achieve.
In this issue and in #260, the problem boils down to contention between application code and Failsafe internals for a limited number of available threads. But threads are potentially scarce system resources, and we don't expect Failsafe to handle the depletion of other system resources, like heap memory, gracefully. Using separate thread pools gives the illusion of having independent sandboxes, but ultimately there are limits imposed by the JVM, by other processes on the machine running the JVM, and potentially by other machines running on the same virtualized hardware. This issue and #260 demonstrate how things can break down under thread scarcity, but they aren't direct indictments of Timeout.
That said, I think it would be good to explore @timothybasanov's idea of moving some Failsafe internals, particularly the Timeout internals, to the common FJP. Failing that, I think @jhalterman's idea above, of allowing Timeouts to optionally be configured with a different thread pool, is reasonable.
I think the cheapest way with full backwards compatibility is to always schedule application logic via
schedule(callable, 0)
and schedule all Failsafe logic viaschedule(callable, !=0)
.
I'm not sure about this. There's a performance penalty associated with a non-zero delay. More importantly, it could subtly change timing that users were (perhaps unreasonably) relying on. Better to hash the larger issue out carefully, waiting for a major version change, and not risk a lot of complaints from users whose code suddenly stops working.
I encountered multiple different issues regarding timeouts not firing when the main thread pool and/or its queues are full or busy for a long time.
The Timeout aspect of this issue was resolved in e41381b11912d8ec09e49bc3a067f50b5285faf3, where Timeouts now use the internal Scheduler (backed by the common ForkJoinPool, when possible).
Not sure if a retry/fallback policies also need a fix or not, they schedule things
Those are a bit different since actual user-supplied executable logic may be run within a RetryPolicy or Fallback's scheduled thread, whereas a Timeout doesn't do much, it's purely internal. That's a good reason for moving Timeouts to an internal thread and leaving Fallbacks and retries on the user-supplied scheduler, if any.
I believe this is resolved now so I'm closing. Feel free to reopen if not.
I encountered multiple different issues regarding timeouts not firing when the main thread pool and/or its queues are full or busy for a long time.
In most cases issues cause timeouts to either happen with a delay or never happen at all. It did affect real running code as soon as some RPC hanging request on one of the code paths was encountered. It quickly filled up the thread pools and prevented anything else from being executed grinding app to a halt.
Here are some reproducible scenarios:
There was no simple way to fix the retry policy behaviour, so I added a custom "async" policy that always deferred execution to a thread pool. It could only work together with a
SimpleDelegatingScheduler
introduced above in the naive fix.Unfortunately my lack of understanding of Failsafe API prevented me from creating an elegant solution. It fixed some, but not all the issues. At this point I think somebody with more understanding of the internal Failsafe piping should take over: