apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.36k stars 2.2k forks source link

Is dataFiles() Method Retryable? #10750

Open osy497 opened 2 months ago

osy497 commented 2 months ago

Query engine

JAVA API

Question

We have been trying to store our data into Iceberg table with version 1.5.2 of Iceberg.

Now, we are using Rest catalog, s3FileIO, and Parquet as data format, and the related code to flush the writer is following logic:

AppendFiles appendFiles = table.newAppend();
DataFile[] dataFiles = writer.dataFiles();
for (var dataFile : dataFiles) {
    appendFiles.appendFile(dataFile);
}
appendFiles.commit();

The above flush code works fine for the most case, but the dataFiles() code sometimes fails with an exception due to a timeout or something.

When this happens, we are currently writing the entire data into writer again and flushing it again, which I think is a huge overhead.

To avoid this, we would like to add retry logic to the dataFiles if the dataFiles() method is retryable.

For example, if in dataFiles(), part of the data in the writer buffer succeeds and part fails, will there be a problem with retrying?

Your answer would be appreciated.

nk1506 commented 2 months ago

Hi @osy497 , This question is quite vague. Could you please provide some stack trace? If the issue is related to dataFiles()and involves an IOException, it might be failing while closing the stream.

osy497 commented 2 months ago

@nk1506 I got something like these:

Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Remote backend is unreachable (ConcurrentModification: concurrent modification) (Service: S3, Status Code: 400, Request ID: 17E4744B40A060BB)
  at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156) ~[test-app.jar:?]
  at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108) ~[test-app.jar:?]
  at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85) ~[test-app.jar:?]
  at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43) ~[test-app.jar:?]
  at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182) ~[test-app.jar:?]
  at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74) ~[test-app.jar:?]
  at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[test-app.jar:?]
  at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53) ~[test-app.jar:?]
  at software.amazon.awssdk.services.s3.DefaultS3Client.putObject(DefaultS3Client.java:10191) ~[test-app.jar:?]
  at org.apache.iceberg.aws.s3.S3OutputStream.completeUploads(S3OutputStream.java:438) ~[test-app.jar:?]
  at org.apache.iceberg.aws.s3.S3OutputStream.close(S3OutputStream.java:265) ~[test-app.jar:?]
  at org.apache.parquet.io.DelegatingPositionOutputStream.close(DelegatingPositionOutputStream.java:38) ~[test-app.jar:?]
  at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1204) ~[test-app.jar:?]
  at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:257) ~[test-app.jar:?]
  at org.apache.iceberg.io.DataWriter.close(DataWriter.java:82) ~[test-app.jar:?]
  at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:314) ~[test-app.jar:?]
  at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:341) ~[test-app.jar:?]
  at org.apache.iceberg.io.PartitionedFanoutWriter.close(PartitionedFanoutWriter.java:70) ~[test-app.jar:?]
  at org.apache.iceberg.io.BaseTaskWriter.complete(BaseTaskWriter.java:96) ~[test-app.jar:?]
  at org.apache.iceberg.io.TaskWriter.dataFiles(TaskWriter.java:50) ~[test-app.jar:?]
  ...
  Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out

or

Caused by: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 400, Request ID: null)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156) ~[test-app.jar:?]
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108) ~[test-app.jar:?]
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85) ~[test-app.jar:?]
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43) ~[test-app.jar:?]
    at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182) ~[test-app.jar:?]
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74) ~[test-app.jar:?]
    at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[test-app.jar:?]
    at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53) ~[test-app.jar:?]
    at software.amazon.awssdk.services.s3.DefaultS3Client.putObject(DefaultS3Client.java:10191) ~[test-app.jar:?]
    at org.apache.iceberg.aws.s3.S3OutputStream.completeUploads(S3OutputStream.java:438) ~[test-app.jar:?]
    at org.apache.iceberg.aws.s3.S3OutputStream.close(S3OutputStream.java:265) ~[test-app.jar:?]
    at org.apache.parquet.io.DelegatingPositionOutputStream.close(DelegatingPositionOutputStream.java:38) ~[test-app.jar:?]
    at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1204) ~[test-app.jar:?]
    at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:257) ~[test-app.jar:?]
    at org.apache.iceberg.io.DataWriter.close(DataWriter.java:82) ~[test-app.jar:?]
    at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:314) ~[test-app.jar:?]
    at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:341) ~[test-app.jar:?]
    at org.apache.iceberg.io.PartitionedFanoutWriter.close(PartitionedFanoutWriter.java:70) ~[test-app.jar:?]
    at org.apache.iceberg.io.BaseTaskWriter.complete(BaseTaskWriter.java:96) ~[test-app.jar:?]
    at org.apache.iceberg.io.TaskWriter.dataFiles(TaskWriter.java:50) ~[test-app.jar:?]
    ...
nk1506 commented 2 months ago

Hi @osy497 , as per stacktrace the error is off type 400(BAD_REQUEST). I don't think above errors are re-triable errors.

osy497 commented 2 months ago

@nk1506

Could you elaborate on what happens if I retry dataFiles() when the above exception is thrown? (Additionally, I am using minio for s3 proxy.)

nk1506 commented 2 months ago

@osy497 , As I can see in the description you are re-trying after rewriting everything. Since this error is coming when writer has completed the operation and S3client is not able to upload the same file. If error is related to connection time-out or similar retry should help. But here it seems it is throwing BAD_REQUEST. By any chance did you check with community on slack ?

osy497 commented 2 months ago

@nk1506 Most of cases seems timeout problem, but i'm not sure about that. I will ask for this in Slack channel later. Thanks for your explanation :)

steveloughran commented 2 months ago

Is this an AWS s3 store? I don't see the extended request IDs in the stack trace you get from there...