esbtools / event-handler

Notification in, document out.
GNU General Public License v3.0
3 stars 6 forks source link

Potential race condition in lazy futures #83

Open alechenninger opened 7 years ago

alechenninger commented 7 years ago

I think this can happen?

  1. Future A calls get, see it is not completed, calls completer, starts waiting
  2. Future B, in same batch, calls get, see it is not completed, calls completer, this future is no longer in the batch, so it completes before future A's call to completer
  3. Future B sees it is not completed but it called its completer. Kaboom!

Seems like an obvious miss, so perhaps this is not right. Will look later.

alechenninger commented 7 years ago

@dcrissman @snehagunta

So this has to do with the timeout handling we recently added. Basically, currently we don't share a bulk requester between multiple threads, and wait for all futures individually. However, when one times out (a new feature), we move on to the next future, which will be in the same batch, but at that point the completer is running in another thread, so the next future's .get will be called and won't be completed, triggering the bug.

The fix is complicated. A workaround is to set the timeout really high for messages. I'd like to talk about fixing for real, but I will be out next week.

alechenninger commented 7 years ago

@snehagunta We should also look at why those messages are timing out

dcrissman commented 7 years ago

Seen exception:

2017-04-26 14:26:36,685 [Camel (camel-1) thread #0 - SjmsBatchConsumer] ERROR [org.esbtools.eventhandler.AsyncBatchMessageProcessorRoute] Failed to process message: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Future attempted to lazily trigger completion, but completer did not actually complete the future. Check the provided completer function for correctness.
    at org.esbtools.eventhandler.lightblue.client.BulkLightblueRequester$LazyTransformableFuture.get(BulkLightblueRequester.java:408)
    at org.esbtools.eventhandler.lightblue.client.BulkLightblueRequester$LazyTransformableFuture.get(BulkLightblueRequester.java:383)
    at org.esbtools.eventhandler.lightblue.client.BulkLightblueRequester$LazyRequestTransformableFuture.get(BulkLightblueRequester.java:548)
    at org.esbtools.eventhandler.NestedTransformableFutureIgnoringReturn.get(NestedTransformableFutureIgnoringReturn.java:64)
    at org.esbtools.eventhandler.NestedTransformableFutureIgnoringReturn.get(NestedTransformableFutureIgnoringReturn.java:25)
    at org.esbtools.eventhandler.AsyncBatchMessageProcessorRoute.lambda$configure$0(AsyncBatchMessageProcessorRoute.java:123)
    at org.apache.camel.processor.DelegateSyncProcessor.process(DelegateSyncProcessor.java:63)
    at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)
    at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:468)
    at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at org.apache.camel.processor.Pipeline.process(Pipeline.java:121)
    at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)
    at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:109)
    at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:87)
    at org.apache.camel.component.sjms.batch.SjmsBatchConsumer$BatchConsumptionLoop.processBatch(SjmsBatchConsumer.java:427)
    at org.apache.camel.component.sjms.batch.SjmsBatchConsumer$BatchConsumptionLoop.access$1900(SjmsBatchConsumer.java:215)
    at org.apache.camel.component.sjms.batch.SjmsBatchConsumer$BatchConsumptionLoop$BatchConsumptionTask.completionBatch(SjmsBatchConsumer.java:367)
    at org.apache.camel.component.sjms.batch.SjmsBatchConsumer$BatchConsumptionLoop$BatchConsumptionTask.consumeBatchesOnLoop(SjmsBatchConsumer.java:303)
    at org.apache.camel.component.sjms.batch.SjmsBatchConsumer$BatchConsumptionLoop$BatchConsumptionTask.access$400(SjmsBatchConsumer.java:269)
    at org.apache.camel.component.sjms.batch.SjmsBatchConsumer$BatchConsumptionLoop.run(SjmsBatchConsumer.java:236)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Future attempted to lazily trigger completion, but completer did not actually complete the future. Check the provided completer function for correctness.
    ... 24 more
alechenninger commented 7 years ago

I think to fix we need to:

  1. Track what requests are being processed. This must be an atomic move from queued to in progress.
  2. Completer must start its requests (this happens already), but it should also wait for its requests to be completed if they are already in progress by another thread (doesn't happen). One option is to pass requests to doQueuedRequestsAndCompleteFutures and have it wait. The "waiting" part is also nontrivial: you will probably need a CountDownLatch associated with a batch of in progress requests.

This means however, that a timed-out future may fail everything in that batch, since the thread making the requests is interrupted, therefore throws an exception, therefore fails all of the futures in that batch. However, we have retries and idempotent messages for this reason already.

Food for thought: hystrix already has request batching built in. We could consider instead making a HystrixBulkLightblueRequester that made each request using hystrix, and let hystrix handle the thread pooling and batching, as long as we made commands batchable. This would probably be more maintainable, safer, and easier to implement. I would strongly consider that route.