Open puchengy opened 4 months ago
Also seeing SSLException when accessing pre-signed urls.
Caused by: javax.net.ssl.SSLException: Connection reset
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:298)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:293)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:142)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1430)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:982)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
at io.delta.sharing.client.RandomAccessHttpInputStream.read(RandomAccessHttpInputStream.scala:128)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1872)
at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1020)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:969)
at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1083)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:134)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:235)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:41)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:83)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:609)
... 40 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:386)
... 66 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)
at sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:459)
at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243)
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
... 63 more
Driver stacktrace:
Do we have any solution to this issue? I'm getting this issue while reading iceberg tables in glue.
Hi, This issue/bug has been open for a while now. Do we know when can we expect a fix? Or is there any workaround?
Background: I'm joining multiple iceberg tables in glue that have 3 merges applied on them. Whenever I do any transform joining these table and write it to non-iceberg glue table, I'm getting SSL connection reset exception. On further checking exception in the executor logs I see Base Reader exception in reading delete files or data files.
Error:
24/08/12 04:07:15 ERROR BaseReader: Error reading file(s): s3://some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/00598-112719-90dfe711-47dc-43e7-af6c-3c5395c527b6-00024.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113207-90dfe711-47dc-43e7-af6c-3c5395c527b6-00025-deletes.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113214-45a89e31-efe0-4110-bdb3-e467a520b1b3-00025-deletes.parquet
org.apache.iceberg.exceptions.RuntimeIOException: javax.net.ssl.SSLException: Connection reset
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: javax.net.ssl.SSLException: Connection reset
at sun.security.ssl.Alert.createSSLException(Alert.java:127) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:331) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
... 27 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_412]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_412]
at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:362) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_412]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190) ~[?:1.8.0_412]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
... 27 more
I have tried using updated version of iceberg i.e 1.6.0 as well, but getting same error.
one of those stack traces is from deltaio, so nothing to do with iceberg
both of them are caused by the AWS sdk itself not retrying, or retrying but not enough times for the problem to recover. There's also http connection pooling at play here too: there's no point the library trying to repeat the request if it keeps returning the failed stream to the pool for it to be picked up again.
Some suggestions
aws.retryMode
and aws.maxAttempts
and see if setting things there helpThanks for the suggestion, I will try them out. However, there is a pull request open already. Also, @danielcweeks mentioned here -> https://github.com/apache/iceberg/pull/4912 about a neat implementation for this issue. Are there any plans on iceberg side to handle it? I'm asking because it's very common issue asked multiple times on various platforms without a solution.
I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.
I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.
1. If you want to use the S3FileIO: try those options. 2. If you want an S3 client which has fixes for all the failures we've hit: S3A is your friend. 3. Or you take up the PR, do your own iceberg release with it and let everyone know if it does/doesn't work. Real world pre-release testing is the way to do this
I tried retry options with S3FileIO but I don't see any improvement. Some days the job succeeds without issues some days it needs 1 retry and some days 5. So, no config seems to work here.
I have also tried your suggestions in previous comment: using Hadoop s3a or increase values for aws.retryMode
and aws.maxAttempts
, but that also didn't help.
I can try with a custom S3A client.
@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?
I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.
@danielcweeks We had some workload happens very frequently and how we solved it is by using HadoopFileIO instead. Just for sharing a data point.
The error for us is fairly infrequent, less than 1 per minute on a large busy cluster, though there are occasional spikes higher. This was enough for us to patch our version of Iceberg and add retries to the S3InputStream.
@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?
I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.
In our workloads, we process data for 20 marketplaces/countries in separate runs. One observation is that larger data sizes increase the likelihood of encountering this exception. We never see this issue with marketplaces that have fewer records, and we encounter it less frequently with those that have a medium number of records.
Our workloads utilize Glue-Spark, and the transformation process involves joining 4-5 tables, with the driving table containing 25 billion rows. After applying proper filtering for the targeted marketplace, we process output data ranging from a few million to 8 billion records(depending on a marketplace).
Even after increasing the number of workers, we continue to face the same issue. If a job takes 2 hours to complete, the exception may be thrown at 30 minutes, or sometimes around an hour. In contrast, when processing data using Hive tables, we do not encounter this issue, although the runtime is longer.
We are transitioning our workloads to use open table formats like Iceberg to reduce processing costs. However, with multiple retries, we are incurring higher costs than we initially anticipated in savings.
Apache Iceberg version
1.3.1
Query engine
Spark
Please describe the bug 🐞