databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

DecimalType parsing fails on some values #622

Closed agolovenko closed 1 year ago

agolovenko commented 1 year ago

I'm using from_xml function to parse messages on spark 3.3.0, spark-xml 0.15.0 and scala 2.12.

Oftentimes I see errors while parsing DecimalType for some values. I attached the Spec that fails with the exception:

Decimal scale (4) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (4) cannot be greater than precision (1).
    at org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1690)
    at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:49)
import com.databricks.spark.xml.functions.from_xml
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, from_json}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.scalatest.matchers.should.Matchers
import org.scalatest.wordspec.AnyWordSpec

class SparkXmlSpec extends AnyWordSpec with Matchers {
  implicit val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("UnitTest session")
    .getOrCreate()

  import spark.implicits._

  private val schema = StructType(
    Seq(
      StructField("Number", DecimalType(7, 4), nullable = false)
    )
  )

  private val failingNumbers = Seq("0.0000", "0.01")

  "reads xml" in {
    val xmlOptions = Map(
      "rowTag" -> "Row"
    )

    val outputDF = failingNumbers
      .map { n =>
        s"""<?xml version="1.0" encoding="UTF-8"?>
        |<Row> <Number>$n</Number> </Row>
        |""".stripMargin
      }
      .toDF("xml")
      .withColumn("parsed", from_xml(col("xml"), schema, xmlOptions))
      .select("parsed.Number")

    outputDF.show(false)
  }

  "reads json" in {
    val outputDF = failingNumbers
      .map { n => s"""{ "Number": $n }""" }
      .toDF("json")
      .withColumn("parsed", from_json(col("json"), schema))
      .select("parsed.Number")

    outputDF.show(false)
  }
}
srowen commented 1 year ago

OK yeah I see the problem. Java allows BigDecimals with scale greater than precision, but Spark doesn't like having a decimal type like that. It will ultimately accept it fine if I return Decimal rather than BigDecimal internally. I will submit a PR shortly

agolovenko commented 1 year ago

Thanks for a prompt fix @srowen !