apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.17k stars 2.15k forks source link

javax.net.ssl.SSLException: Connection reset on S3 w/ S3FileIO and Apache HTTP client #10340

Open puchengy opened 4 months ago

puchengy commented 4 months ago

Apache Iceberg version

1.3.1

Query engine

Spark

Please describe the bug 🐞

24/05/15 15:10:31 ERROR [Executor task launch worker for task 34.0 in stage 14.0 (TID 406)] source.BaseReader: Error reading file(s): s3://bucket/.../file.parquete
org.apache.iceberg.exceptions.RuntimeIOException: javax.net.ssl.SSLException: Connection reset
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:153)
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:130)
    at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
    at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
    at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:129)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
    at scala.Option.exists(Option.scala:376)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:412)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1504)
    at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:457)
    at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:358)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: javax.net.ssl.SSLException: Connection reset
    at sun.security.ssl.Alert.createSSLException(Alert.java:127)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1400)
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1368)
    at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:962)
    at software.amazon.awssdk.thirdparty.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at software.amazon.awssdk.thirdparty.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
    at software.amazon.awssdk.thirdparty.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
    at software.amazon.awssdk.thirdparty.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.read(ChecksumValidatingInputStream.java:112)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66)
    at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109)
    at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
    at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
    at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940)
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:151)
    ... 27 more
    Suppressed: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
        ... 50 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
    at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237)
    at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
    ... 47 more
linzhou-db commented 1 month ago

Also seeing SSLException when accessing pre-signed urls.

Caused by: javax.net.ssl.SSLException: Connection reset
    at sun.security.ssl.Alert.createSSLException(Alert.java:127)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:298)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:293)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:142)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1430)
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1395)
    at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:982)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
    at io.delta.sharing.client.RandomAccessHttpInputStream.read(RandomAccessHttpInputStream.scala:128)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1872)
    at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1020)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:969)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1083)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:134)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:235)
    at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:41)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:83)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:609)
    ... 40 more
    Suppressed: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:386)
        ... 66 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)
    at sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:459)
    at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243)
    at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
    ... 63 more

Driver stacktrace:
SandeepSinghGahir commented 1 month ago

Do we have any solution to this issue? I'm getting this issue while reading iceberg tables in glue.

SandeepSinghGahir commented 1 month ago

Hi, This issue/bug has been open for a while now. Do we know when can we expect a fix? Or is there any workaround?

Background: I'm joining multiple iceberg tables in glue that have 3 merges applied on them. Whenever I do any transform joining these table and write it to non-iceberg glue table, I'm getting SSL connection reset exception. On further checking exception in the executor logs I see Base Reader exception in reading delete files or data files.

Error:

24/08/12 04:07:15 ERROR BaseReader: Error reading file(s): s3://some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/00598-112719-90dfe711-47dc-43e7-af6c-3c5395c527b6-00024.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113207-90dfe711-47dc-43e7-af6c-3c5395c527b6-00025-deletes.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113214-45a89e31-efe0-4110-bdb3-e467a520b1b3-00025-deletes.parquet
org.apache.iceberg.exceptions.RuntimeIOException: javax.net.ssl.SSLException: Connection reset
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
 at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: javax.net.ssl.SSLException: Connection reset
 at sun.security.ssl.Alert.createSSLException(Alert.java:127) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:331) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
 at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 ... 27 more
 Suppressed: java.net.SocketException: Broken pipe (Write failed)
 at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_412]
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_412]
 at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:362) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
 at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
 at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_412]
 at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190) ~[?:1.8.0_412]
 at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 ... 27 more

I have tried using updated version of iceberg i.e 1.6.0 as well, but getting same error.

steveloughran commented 1 month ago

one of those stack traces is from deltaio, so nothing to do with iceberg

both of them are caused by the AWS sdk itself not retrying, or retrying but not enough times for the problem to recover. There's also http connection pooling at play here too: there's no point the library trying to repeat the request if it keeps returning the failed stream to the pool for it to be picked up again.

Some suggestions

SandeepSinghGahir commented 1 month ago

Thanks for the suggestion, I will try them out. However, there is a pull request open already. Also, @danielcweeks mentioned here -> https://github.com/apache/iceberg/pull/4912 about a neat implementation for this issue. Are there any plans on iceberg side to handle it? I'm asking because it's very common issue asked multiple times on various platforms without a solution.

steveloughran commented 1 month ago

I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.

  1. If you want to use the S3FileIO: try those options.
  2. If you want an S3 client which has fixes for all the failures we've hit: S3A is your friend.
  3. Or you take up the PR, do your own iceberg release with it and let everyone know if it does/doesn't work. Real world pre-release testing is the way to do this
SandeepSinghGahir commented 2 days ago

I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.

1. If you want to use the S3FileIO: try those options.

2. If you want an S3 client which has fixes for all the failures we've hit: S3A is your friend.

3. Or you take up the PR, do your own iceberg release with it and let everyone know if it does/doesn't work. Real world pre-release testing is the way to do this

I tried retry options with S3FileIO but I don't see any improvement. Some days the job succeeds without issues some days it needs 1 retry and some days 5. So, no config seems to work here.

I have also tried your suggestions in previous comment: using Hadoop s3a or increase values for aws.retryMode and aws.maxAttempts, but that also didn't help.

I can try with a custom S3A client.

danielcweeks commented 2 days ago

@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?

I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.

puchengy commented 2 days ago

@danielcweeks We had some workload happens very frequently and how we solved it is by using HadoopFileIO instead. Just for sharing a data point.

bryanck commented 1 day ago

The error for us is fairly infrequent, less than 1 per minute on a large busy cluster, though there are occasional spikes higher. This was enough for us to patch our version of Iceberg and add retries to the S3InputStream.

SandeepSinghGahir commented 1 day ago

@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?

I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.

In our workloads, we process data for 20 marketplaces/countries in separate runs. One observation is that larger data sizes increase the likelihood of encountering this exception. We never see this issue with marketplaces that have fewer records, and we encounter it less frequently with those that have a medium number of records.

Our workloads utilize Glue-Spark, and the transformation process involves joining 4-5 tables, with the driving table containing 25 billion rows. After applying proper filtering for the targeted marketplace, we process output data ranging from a few million to 8 billion records(depending on a marketplace).

Even after increasing the number of workers, we continue to face the same issue. If a job takes 2 hours to complete, the exception may be thrown at 30 minutes, or sometimes around an hour. In contrast, when processing data using Hive tables, we do not encounter this issue, although the runtime is longer.

We are transitioning our workloads to use open table formats like Iceberg to reduce processing costs. However, with multiple retries, we are incurring higher costs than we initially anticipated in savings.