Error reading variable length ASCII file

pritdb commented 3 years ago

Describe the bug

Running into an issue when trying to read a variable length newline separated ASCII file using Cobrix. Please see the Stacktrace below:

at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:757)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:91)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1643)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArithmeticException: BigInteger would overflow supported range
    at java.math.BigInteger.reportOverflow(BigInteger.java:1084)
    at java.math.BigInteger.pow(BigInteger.java:2391)
    at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3574)
    at java.math.BigDecimal.bigMultiplyPowerTen(BigDecimal.java:3707)
    at java.math.BigDecimal.setScale(BigDecimal.java:2448)
    at java.math.BigDecimal.setScale(BigDecimal.java:2515)
    at scala.math.BigDecimal.setScale(BigDecimal.scala:646)
    at org.apache.spark.sql.types.Decimal.set(Decimal.scala:147)
    at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:559)
    at org.apache.spark.sql.types.Decimal$.fromDecimal(Decimal.scala:582)
    at org.apache.spark.sql.types.Decimal.fromDecimal(Decimal.scala)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_6.StaticInvoke_1338$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_6.createNamedStruct_36_4$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_6.createNamedStruct_36$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_52_2$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_52$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_53_5$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_53$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_8.writeFields_35_15$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_8.writeFields_35$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:235)
    ... 30 more

To Reproduce

Steps to reproduce the behaviour OR commands run:

Read a Variable length ASCII of a bigger size (like at least 100 GB)

Use the following options to read

.option("encoding", "ascii")
.option("is_text", "true")

See error

Expected behaviour

The file should parse correctly.

Screenshots

Please see the Stacktrace provided above.

yruslan commented 3 years ago

Hi, Thanks for the bug report. At the first glance, it looks like a value is encountered that cannot fit into a decimal type. It shouldn't be related to the size of the file.

What is the copybook for this file?

pritdb commented 3 years ago

Thanks @yruslan . I am attaching the copybook Sample.txt .

yruslan commented 3 years ago

Thanks, will take a look and try to reproduce the issue

yruslan commented 3 years ago

Hi, I was trying to find the condition that could cause the overflow, but couldn't find it so far. So asking you to answer more questions:

Which Spark and Scala/Python version are you using?
What's you reader statement (the one starting with spark.read...) and the action you are performing (df.write....).

The reason for these questions is that the error usually means there is a data type incompatibility between schemas. Unfortunately, Spark error does not tell us which column is affected.

yruslan commented 3 years ago

In addition, are using SaveMode.Overwrite or SaveMode.Append when writing to the output folder?

pritdb commented 3 years ago

Hi @yruslan ,

Here are the versions: Apache Spark 3.1.2 Scala: 2.12.10

And the read & write code:

val df = spark
  .read
  .format("cobol")
  .option("copybook", "<path-to-copybook>")
  .option("encoding", "ascii")
  .option("is_text", "true")
  .option("schema_retention_policy", "collapse_root")
  .option("drop_value_fillers", "false")
  .load(inputFile)

// Causes the error
df.count()

// Causes the same error
df
  .write
  .partitionBy("field-1")
  .format("delta")
  .mode("overwrite")
  .option("replaceWhere", s"field-1 = '$field1_value' ")
  .save(outputPath)

yruslan commented 3 years ago

Thanks for the info. It's very helpful. Will try testing more using Spark 3.1.2/Scala2.12

yruslan commented 3 years ago

Hi, I still wasn't able to reproduce the issue. Tried various ways an input data can overflow numeric data types. A new version is released (2.4.4). Parsing ASCII numbers made more strict. Please, check if the issue persists.

If the issue persists, I'll try testing it on a big ASCII file, 1GB.

yruslan commented 2 years ago

Hi, there is a progress on fixing the issue. There is a workaround:

 .option("enable_indexes", "false")

and we are working on the fix

pritdb commented 2 years ago

Great. Thanks for the update @yruslan . Hope the fix will be part of the next release.

yruslan commented 2 years ago

We were unable to reproduce exactly this issue but found another issue that happens on big ASCII files. If disabling indexes helps reading your file, there is a good chance that the fix will help as well.

yruslan commented 2 years ago

A new version (2.4.5) is released. Please, let me know if it fixes the issue.

pritdb commented 2 years ago

Thanks a lot @yruslan for all the updates. I don't currently have access to the environment where this occurred, but have requested the folks who have to test this out. Will keep you posted when I hear from them.

AbsaOSS / cobrix