Closed puchengy closed 1 month ago
Also seeing SSLException when accessing pre-signed urls.
Caused by: javax.net.ssl.SSLException: Connection reset
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:298)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:293)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:142)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1430)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:982)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
at io.delta.sharing.client.RandomAccessHttpInputStream.read(RandomAccessHttpInputStream.scala:128)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102)
at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1872)
at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1020)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:969)
at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1083)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:134)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:235)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:41)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:83)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:609)
... 40 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:386)
... 66 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)
at sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:459)
at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243)
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
... 63 more
Driver stacktrace:
Do we have any solution to this issue? I'm getting this issue while reading iceberg tables in glue.
Hi, This issue/bug has been open for a while now. Do we know when can we expect a fix? Or is there any workaround?
Background: I'm joining multiple iceberg tables in glue that have 3 merges applied on them. Whenever I do any transform joining these table and write it to non-iceberg glue table, I'm getting SSL connection reset exception. On further checking exception in the executor logs I see Base Reader exception in reading delete files or data files.
Error:
24/08/12 04:07:15 ERROR BaseReader: Error reading file(s): s3://some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/00598-112719-90dfe711-47dc-43e7-af6c-3c5395c527b6-00024.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113207-90dfe711-47dc-43e7-af6c-3c5395c527b6-00025-deletes.parquet, s3:// some-bucket/iceberg_catalog/iceberg_db.db/d_table/data/0yWGCw/region_id=1/marketplace_id=7/asin_bucket=7044/01086-113214-45a89e31-efe0-4110-bdb3-e467a520b1b3-00025-deletes.parquet
org.apache.iceberg.exceptions.RuntimeIOException: javax.net.ssl.SSLException: Connection reset
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: javax.net.ssl.SSLException: Connection reset
at sun.security.ssl.Alert.createSSLException(Alert.java:127) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:331) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
... 27 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_412]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_412]
at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:362) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:274) ~[?:1.8.0_412]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:269) ~[?:1.8.0_412]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:138) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:136) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:968) ~[spark-sql_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_412]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:237) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:190) ~[?:1.8.0_412]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:109) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1404) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1372) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73) ~[?:1.8.0_412]
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:966) ~[?:1.8.0_412]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.services.s3.internal.checksums.S3ChecksumValidatingInputStream.read(S3ChecksumValidatingInputStream.java:112) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingInputStream.read(BytesReadTrackingInputStream.java:49) ~[iceberg-aws-bundle-1.5.0.jar:?]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:1.8.0_412]
at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) ~[iceberg-aws-bundle-1.5.0.jar:?]
at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ~[iceberg-spark-runtime-3.3_2.12-1.5.0.jar:?]
... 27 more
I have tried using updated version of iceberg i.e 1.6.0 as well, but getting same error.
one of those stack traces is from deltaio, so nothing to do with iceberg
both of them are caused by the AWS sdk itself not retrying, or retrying but not enough times for the problem to recover. There's also http connection pooling at play here too: there's no point the library trying to repeat the request if it keeps returning the failed stream to the pool for it to be picked up again.
Some suggestions
aws.retryMode
and aws.maxAttempts
and see if setting things there helpThanks for the suggestion, I will try them out. However, there is a pull request open already. Also, @danielcweeks mentioned here -> https://github.com/apache/iceberg/pull/4912 about a neat implementation for this issue. Are there any plans on iceberg side to handle it? I'm asking because it's very common issue asked multiple times on various platforms without a solution.
I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.
I can't speak for the S3FileIO developers; S3AFS is where I code and while there's a lot of work there for recovery here and elsewhere, we are all still finding obscure recovery failures one by one, such as how the AWS SDK doesn't recovery properly if a multipart part upload fails with a 500.
1. If you want to use the S3FileIO: try those options. 2. If you want an S3 client which has fixes for all the failures we've hit: S3A is your friend. 3. Or you take up the PR, do your own iceberg release with it and let everyone know if it does/doesn't work. Real world pre-release testing is the way to do this
I tried retry options with S3FileIO but I don't see any improvement. Some days the job succeeds without issues some days it needs 1 retry and some days 5. So, no config seems to work here.
I have also tried your suggestions in previous comment: using Hadoop s3a or increase values for aws.retryMode
and aws.maxAttempts
, but that also didn't help.
I can try with a custom S3A client.
@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?
I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.
@danielcweeks We had some workload happens very frequently and how we solved it is by using HadoopFileIO instead. Just for sharing a data point.
The error for us is fairly infrequent, less than 1 per minute on a large busy cluster, though there are occasional spikes higher. This was enough for us to patch our version of Iceberg and add retries to the S3InputStream.
@SandeepSinghGahir I'm really surprised that you're hitting this issue so frequently. Is there something specific about this workload that you think might be triggering this issue?
I asked @bryanck to see how frequently he sees this happening, but I wouldn't expect it to be a common occurrence.
In our workloads, we process data for 20 marketplaces/countries in separate runs. One observation is that larger data sizes increase the likelihood of encountering this exception. We never see this issue with marketplaces that have fewer records, and we encounter it less frequently with those that have a medium number of records.
Our workloads utilize Glue-Spark, and the transformation process involves joining 4-5 tables, with the driving table containing 25 billion rows. After applying proper filtering for the targeted marketplace, we process output data ranging from a few million to 8 billion records(depending on a marketplace).
Even after increasing the number of workers, we continue to face the same issue. If a job takes 2 hours to complete, the exception may be thrown at 30 minutes, or sometimes around an hour. In contrast, when processing data using Hive tables, we do not encounter this issue, although the runtime is longer.
We are transitioning our workloads to use open table formats like Iceberg to reduce processing costs. However, with multiple retries, we are incurring higher costs than we initially anticipated in savings.
@SandeepSinghGahir Thanks for the additional context (it really helps to have specifics like this). I think we're close to having a solution for this and @amogh-jahagirdar will likely have it for the 1.7 release.
we solved it is by using HadoopFileIO instead.
we put a lot of effort into making S3 stack traces go away, usually adding more handling one support call at a time. Special mention for openssl there. You aren't using it underneath are you?
Now that you are using the S3A connector, if you can adopt parquet 14.1 and hadoop-3.4.0 then you switch parquet to using hadoop's vector IO for a significant speedup in parquet reads.
@danielcweeks thanks a lot for the update and prioritizing the fix. Looking forward to the 1.7 release. @amogh-jahagirdar thanks for all the hard work 🙌
Should we add this issue to the v1.7 milestone? https://github.com/apache/iceberg/milestone/47
@SandeepSinghGahir The PR that @amogh-jahagirdar implemented will be included in v1.7, so there's no need to add it to the milestone. We generally use the milestone to track larger items that we want to target for the release, but in this case, I think we're already good.
Hi, I just found out in the milestones that v1.7 will no longer support Java 8. However, AWS glue 4.0 only supports Java 8. Therefore, we won't be able to use v1.7. I also read in mail archives that 1.6.x which has Java 8 support will continued to be supported. Is there a plan to release 1.6.2 to include bug fixes that supports Java 8?
I encountered the same error while attempting to load a large Iceberg table in AWS Glue. Can we prioritize the bug fix for version 1.6.2, since AWS Glue does not support Java 11?
Apache Iceberg version
1.3.1
Query engine
Spark
Please describe the bug 🐞