We have empty json files in our blob storage containers. We have a spark application written in scala that uses the azure-storage connector to pull the files from blob storage and pushes to Nosql database. While testing our code, we found that the connector is throwing exception when it tries to retrieve empty files from the Blob storage. Below is the detail of the exception:
java.io.EOFException: Unexpected end of input stream
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Here is the code snippet where the exception is occuring:
val jsonDF = spark.readStream .format("text") .option("path", "wasbs://history@"+ BlobStorageSink.getAccountName + ".blob.core.windows.net/" + args.assetPath) .option("latestFirst", true) .load()
I see that the failure was while trying to read a byte order marker. You could try replacing the empty files with just the BOM characters and see if the failure still occurs.
We have empty json files in our blob storage containers. We have a spark application written in scala that uses the azure-storage connector to pull the files from blob storage and pushes to Nosql database. While testing our code, we found that the connector is throwing exception when it tries to retrieve empty files from the Blob storage. Below is the detail of the exception:
Here is the code snippet where the exception is occuring:
val jsonDF = spark.readStream .format("text") .option("path", "wasbs://history@"+ BlobStorageSink.getAccountName + ".blob.core.windows.net/" + args.assetPath) .option("latestFirst", true) .load()
Appreciate for help/inputs. Thanks
Best Gnan