Implement Spark-compatible CAST from String to Decimal

andygrove commented 2 months ago

What is the problem the feature request solves?

We currently delegate to DataFusion when casting from string to decimal and there are some differences in behavior compared to Spark.

An input of 4e7 produces 40000000.00 in Spark, and null in DataFusion
Inputs of ., -, + and empty string produce null in Spark, and 0.0 in DataFusion
Input of 0 produces 0 in Spark, and null in DataFusion
Arrow-rs does not support negative scale (Cannot cast string to decimal with negative scale). We could choose to fallback to Spark for this use case (or if SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED is enabled)

Describe the potential solution

No response

Additional context

I used the following test in CometCastSuite to explore this.

  test("cast string to decimal") {
    val values = generateStrings(numericPattern, 5).toDF("a")
    castTest(values, DataTypes.createDecimalType(10, 2))
    castTest(values, DataTypes.createDecimalType(10, 0))
    withSQLConf((SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key, "true")) {
      castTest(values, DataTypes.createDecimalType(10, -2))
    }
  }

Describe the potential solution

No response

Additional context

No response

kevinmingtarja commented 2 months ago

Hi, I'd like to contribute to this!

viirya commented 2 months ago

Thanks @kevinmingtarja.

You can take a look at @andygrove's PR as a reference https://github.com/apache/datafusion-comet/pull/307

kevinmingtarja commented 2 months ago

Current state for reference:

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val inputs = Seq("", "0", "1", "+1.0", ".34", "-10.0", "4e7").toDF("n")
inputs: org.apache.spark.sql.DataFrame = [n: string]

scala> inputs.write.mode("overwrite").parquet("test.parquet")
24/04/28 11:46:51 INFO src/lib.rs: Comet native library initialized

scala> val df = spark.read.parquet("test.parquet")
df: org.apache.spark.sql.DataFrame = [n: string]

scala> val df2 = df.withColumn("converted", col("n").cast(DataTypes.createDecimalType(10, 2)))
df2: org.apache.spark.sql.DataFrame = [n: string, converted: decimal(10,2)]

scala> df2.show
+-----+---------+
|    n|converted|
+-----+---------+
|-10.0|   -10.00|
| +1.0|     1.00|
|  .34|     0.34|
|  4e7|     null|
|    1|     1.00|
|    0|     0.00|
|     |     0.00|
+-----+---------+

scala> spark.conf.set("spark.comet.enabled", false)

scala> df2.show
+-----+-----------+
|    n|  converted|
+-----+-----------+
|-10.0|     -10.00|
| +1.0|       1.00|
|  .34|       0.34|
|  4e7|40000000.00|
|    1|       1.00|
|    0|       0.00|
|     |       null|
+-----+-----------+

Note: I encountered a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional on the second df2.show, but it seems like a known issue and can safely be ignored: https://kb.databricks.com/scala/decimal-is-fractional-error

sujithjay commented 3 weeks ago

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

kevinmingtarja commented 3 weeks ago

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

Hey, i don't think i have the bandwidth rn to complete this, so please feel free to work on it. I have made some progress here on a branch in my fork, so feel free to take inspirations from there as well if needed!

apache / datafusion-comet