apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
646 stars 119 forks source link

Implement Spark-compatible CAST from String to Decimal #325

Open andygrove opened 2 months ago

andygrove commented 2 months ago

What is the problem the feature request solves?

What is the problem the feature request solves?

We currently delegate to DataFusion when casting from string to decimal and there are some differences in behavior compared to Spark.

Describe the potential solution

No response

Additional context

I used the following test in CometCastSuite to explore this.

  test("cast string to decimal") {
    val values = generateStrings(numericPattern, 5).toDF("a")
    castTest(values, DataTypes.createDecimalType(10, 2))
    castTest(values, DataTypes.createDecimalType(10, 0))
    withSQLConf((SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key, "true")) {
      castTest(values, DataTypes.createDecimalType(10, -2))
    }
  }

Describe the potential solution

No response

Additional context

No response

kevinmingtarja commented 2 months ago

Hi, I'd like to contribute to this!

viirya commented 2 months ago

Thanks @kevinmingtarja.

You can take a look at @andygrove's PR as a reference https://github.com/apache/datafusion-comet/pull/307

kevinmingtarja commented 2 months ago

Current state for reference:

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val inputs = Seq("", "0", "1", "+1.0", ".34", "-10.0", "4e7").toDF("n")
inputs: org.apache.spark.sql.DataFrame = [n: string]

scala> inputs.write.mode("overwrite").parquet("test.parquet")
24/04/28 11:46:51 INFO src/lib.rs: Comet native library initialized

scala> val df = spark.read.parquet("test.parquet")
df: org.apache.spark.sql.DataFrame = [n: string]

scala> val df2 = df.withColumn("converted", col("n").cast(DataTypes.createDecimalType(10, 2)))
df2: org.apache.spark.sql.DataFrame = [n: string, converted: decimal(10,2)]

scala> df2.show
+-----+---------+
|    n|converted|
+-----+---------+
|-10.0|   -10.00|
| +1.0|     1.00|
|  .34|     0.34|
|  4e7|     null|
|    1|     1.00|
|    0|     0.00|
|     |     0.00|
+-----+---------+

scala> spark.conf.set("spark.comet.enabled", false)

scala> df2.show
+-----+-----------+
|    n|  converted|
+-----+-----------+
|-10.0|     -10.00|
| +1.0|       1.00|
|  .34|       0.34|
|  4e7|40000000.00|
|    1|       1.00|
|    0|       0.00|
|     |       null|
+-----+-----------+

Note: I encountered a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional on the second df2.show, but it seems like a known issue and can safely be ignored: https://kb.databricks.com/scala/decimal-is-fractional-error

sujithjay commented 3 weeks ago

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

kevinmingtarja commented 3 weeks ago

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

Hey, i don't think i have the bandwidth rn to complete this, so please feel free to work on it. I have made some progress here on a branch in my fork, so feel free to take inspirations from there as well if needed!