apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.49k stars 3.7k forks source link

KinesisRecordSupplier doesn't gracefully handle transient AmazonServiceException #8615

Closed cyuyang closed 4 years ago

cyuyang commented 5 years ago

Affected Version

0.16.0

Description

One of our kinesis indexing tasks halted until the task exited after the preset duration. It caused one of the shards lagging behind. Interestingly, the task reported SUCCESS status after exiting. After digging into the log and the kinesis indexer code, we suspect that the KinesisRecordSupplier doesn't handle some transient exceptions gracefully.

Related logs:

2019-10-01T08:59:04,412 WARN [KinesisRecordSupplier-Worker-0] org.apache.druid.indexing.kinesis.KinesisRecordSupplier - encounted unknown unrecoverable AWS exception, will not retry
com.amazonaws.services.kinesis.model.AmazonKinesisException: null (Service: AmazonKinesis; Status Code: 503; Error Code: null; Request ID: null)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1638) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1303) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) ~[aws-java-sdk-core-1.11.199.jar:?]
    at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) ~[?:?]
    at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) ~[?:?]
    at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) ~[?:?]
    at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) ~[?:?]
    at org.apache.druid.indexing.kinesis.KinesisRecordSupplier$PartitionResource.lambda$getRecordRunnable$0(KinesisRecordSupplier.java:191) ~[?:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_222]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_222]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

Related code:

catch (AmazonServiceException e) {
          if (isServiceExceptionRecoverable(e)) {
            log.warn(e, "encounted unknown recoverable AWS exception, retrying in [%,dms]", EXCEPTION_RETRY_DELAY_MS);
            rescheduleRunnable(EXCEPTION_RETRY_DELAY_MS);
          } else {
            log.warn(e, "encounted unknown unrecoverable AWS exception, will not retry");
            throw new RuntimeException(e);
          }
        }

On transient AmazonServiceException (503 as shown in the log), the thread executing the polling will be killed and it will not be rescheduled on the ExecutorService. No new record will be put on the BlockingQueue. The failure is not reported anywhere and didn't cause the kinesis index task to fail.

stale[bot] commented 4 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.