apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

javax.net.ssl.SSLException: Connection reset on S3 w/ S3FileIO and Apache HTTP client #10340

Closed puchengy closed 1 month ago

puchengy commented 6 months ago

Apache Iceberg version

1.3.1

Query engine

Spark

Please describe the bug 🐞

24/05/15 15:10:31 ERROR [Executor task launch worker for task 34.0 in stage 14.0 (TID 406)] source.BaseReader: Error reading file(s): s3://bucket/.../file.parquete
org.apache.iceberg.exceptions.RuntimeIOException: javax.net.ssl.SSLException: Connection reset
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:153)
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:130)
    at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
    at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
    at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:129)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
    at scala.Option.exists(Option.scala:376)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:412)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1504)
    at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:457)
    at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:358)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: javax.net.ssl.SSLException: Connection reset
    at sun.security.ssl.Alert.createSSLException(Alert.java:127)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1400)
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1368)
    at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:962)
    at software.amazon.awssdk.thirdparty.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at software.amazon.awssdk.thirdparty.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
    at software.amazon.awssdk.thirdparty.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
    at software.amazon.awssdk.thirdparty.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.read(ChecksumValidatingInputStream.java:112)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66)
    at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109)
    at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
    at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
    at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940)
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:151)
    ... 27 more
    Suppressed: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
        ... 50 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
    at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237)
    at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109)
    ... 47 more
linzhou-db commented 3 months ago

Also seeing SSLException when accessing pre-signed urls.

Caused by: javax.net.ssl.SSLException: Connection reset
    at sun.security.ssl.Alert.createSSLException(Alert.java:127)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:298)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:293)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:142)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1430)
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1395)
    at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:982)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
    at io.delta.sharing.client.RandomAccessHttpInputStream.read(RandomAccessHttpInputStream.scala:128)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
    at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1872)
    at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1020)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:969)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1083)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:134)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:235)
    at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:41)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:83)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:609)
    ... 40 more
    Suppressed: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:386)
        ... 66 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)
    at sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:459)
    at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243)
    at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
    ... 63 more

Driver stacktrace:
SandeepSinghGahir commented 3 months ago

Do we have any solution to this issue? I'm getting this issue while reading iceberg tables in glue.

SandeepSinghGahir commented 3 months ago

Hi, This issue/bug has been open for a while now. Do we know when can we expect a fix? Or is there any workaround?

Background: I'm joining multiple iceberg tables in glue that have 3 merges applied on them. Whenever I do any transform joining these table and write it to non-iceberg glue table, I'm getting SSL connection reset exception. On further checking exception in the executor logs I see Base Reader exception in reading delete files or data files.

Error:

24/08/12 04:07:15 ERROR BaseReader: Error reading file(s): s3://some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/00598-112719-90dfe711-47dc-43e7-af6c-3c5395c527b6-00024.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113207-90dfe711-47dc-43e7-af6c-3c5395c527b6-00025-deletes.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113214-45a89e31-efe0-4110-bdb3-e467a520b1b3-00025-deletes.parquet
org.apache.iceberg.exceptions.RuntimeIOException: javax.net.ssl.SSLException: Connection reset
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
 at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: javax.net.ssl.SSLException: Connection reset
 at sun.security.ssl.Alert.createSSLException(Alert.java:127) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:331) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
 at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 ... 27 more
 Suppressed: java.net.SocketException: Broken pipe (Write failed)
 at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_412]
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_412]
 at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:362) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
 at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
 at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
 at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
 at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_412]
 at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190) ~[?:1.8.0_412]
 at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
 at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
 at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
 at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
 ... 27 more

I have tried using updated version of iceberg i.e 1.6.0 as well, but getting same error.

steveloughran commented 3 months ago

one of those stack traces is from deltaio, so nothing to do with iceberg

both of them are caused by the AWS sdk itself not retrying, or retrying but not enough times for the problem to recover. There's also http connection pooling at play here too: there's no point the library trying to repeat the request if it keeps returning the failed stream to the pool for it to be picked up again.

Some suggestions

SandeepSinghGahir commented 3 months ago

Thanks for the suggestion, I will try them out. However, there is a pull request open already. Also, @danielcweeks mentioned here -> https://github.com/apache/iceberg/pull/4912 about a neat implementation for this issue. Are there any plans on iceberg side to handle it? I'm asking because it's very common issue asked multiple times on various platforms without a solution.

steveloughran commented 3 months ago

I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.

  1. If you want to use the S3FileIO: try those options.
  2. If you want an S3 client which has fixes for all the failures we've hit: S3A is your friend.
  3. Or you take up the PR, do your own iceberg release with it and let everyone know if it does/doesn't work. Real world pre-release testing is the way to do this
SandeepSinghGahir commented 2 months ago

I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.

1. If you want to use the S3FileIO: try those options.

2. If you want an S3 client which has fixes for all the failures we've hit: S3A is your friend.

3. Or you take up the PR, do your own iceberg release with it and let everyone know if it does/doesn't work. Real world pre-release testing is the way to do this

I tried retry options with S3FileIO but I don't see any improvement. Some days the job succeeds without issues some days it needs 1 retry and some days 5. So, no config seems to work here.

I have also tried your suggestions in previous comment: using Hadoop s3a or increase values for aws.retryMode and aws.maxAttempts, but that also didn't help.

I can try with a custom S3A client.

danielcweeks commented 2 months ago

@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?

I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.

puchengy commented 2 months ago

@danielcweeks We had some workload happens very frequently and how we solved it is by using HadoopFileIO instead. Just for sharing a data point.

bryanck commented 2 months ago

The error for us is fairly infrequent, less than 1 per minute on a large busy cluster, though there are occasional spikes higher. This was enough for us to patch our version of Iceberg and add retries to the S3InputStream.

SandeepSinghGahir commented 2 months ago

@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?

I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.

In our workloads, we process data for 20 marketplaces/countries in separate runs. One observation is that larger data sizes increase the likelihood of encountering this exception. We never see this issue with marketplaces that have fewer records, and we encounter it less frequently with those that have a medium number of records.

Our workloads utilize Glue-Spark, and the transformation process involves joining 4-5 tables, with the driving table containing 25 billion rows. After applying proper filtering for the targeted marketplace, we process output data ranging from a few million to 8 billion records(depending on a marketplace).

Even after increasing the number of workers, we continue to face the same issue. If a job takes 2 hours to complete, the exception may be thrown at 30 minutes, or sometimes around an hour. In contrast, when processing data using Hive tables, we do not encounter this issue, although the runtime is longer.

We are transitioning our workloads to use open table formats like Iceberg to reduce processing costs. However, with multiple retries, we are incurring higher costs than we initially anticipated in savings.

danielcweeks commented 2 months ago

@SandeepSinghGahir Thanks for the additional context (it really helps to have specifics like this). I think we're close to having a solution for this and @amogh-jahagirdar will likely have it for the 1.7 release.

steveloughran commented 1 month ago

we solved it is by using HadoopFileIO instead.

we put a lot of effort into making S3 stack traces go away, usually adding more handling one support call at a time. Special mention for openssl there. You aren't using it underneath are you?

Now that you are using the S3A connector, if you can adopt parquet 14.1 and hadoop-3.4.0 then you switch parquet to using hadoop's vector IO for a significant speedup in parquet reads.

SandeepSinghGahir commented 1 month ago

@danielcweeks thanks a lot for the update and prioritizing the fix. Looking forward to the 1.7 release. @amogh-jahagirdar thanks for all the hard work 🙌

SandeepSinghGahir commented 1 month ago

Should we add this issue to the v1.7 milestone? https://github.com/apache/iceberg/milestone/47

danielcweeks commented 1 month ago

@SandeepSinghGahir The PR that @amogh-jahagirdar implemented will be included in v1.7, so there's no need to add it to the milestone. We generally use the milestone to track larger items that we want to target for the release, but in this case, I think we're already good.

SandeepSinghGahir commented 3 weeks ago

Hi, I just found out in the milestones that v1.7 will no longer support Java 8. However, AWS glue 4.0 only supports Java 8. Therefore, we won't be able to use v1.7. I also read in mail archives that 1.6.x which has Java 8 support will continued to be supported. Is there a plan to release 1.6.2 to include bug fixes that supports Java 8?

GTerrygo commented 1 day ago

I encountered the same error while attempting to load a large Iceberg table in AWS Glue. Can we prioritize the bug fix for version 1.6.2, since AWS Glue does not support Java 11?