Open andygrove opened 2 months ago
Hi, I'd like to contribute to this!
Thanks @kevinmingtarja.
You can take a look at @andygrove's PR as a reference https://github.com/apache/datafusion-comet/pull/307
Current state for reference:
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val inputs = Seq("", "0", "1", "+1.0", ".34", "-10.0", "4e7").toDF("n")
inputs: org.apache.spark.sql.DataFrame = [n: string]
scala> inputs.write.mode("overwrite").parquet("test.parquet")
24/04/28 11:46:51 INFO src/lib.rs: Comet native library initialized
scala> val df = spark.read.parquet("test.parquet")
df: org.apache.spark.sql.DataFrame = [n: string]
scala> val df2 = df.withColumn("converted", col("n").cast(DataTypes.createDecimalType(10, 2)))
df2: org.apache.spark.sql.DataFrame = [n: string, converted: decimal(10,2)]
scala> df2.show
+-----+---------+
| n|converted|
+-----+---------+
|-10.0| -10.00|
| +1.0| 1.00|
| .34| 0.34|
| 4e7| null|
| 1| 1.00|
| 0| 0.00|
| | 0.00|
+-----+---------+
scala> spark.conf.set("spark.comet.enabled", false)
scala> df2.show
+-----+-----------+
| n| converted|
+-----+-----------+
|-10.0| -10.00|
| +1.0| 1.00|
| .34| 0.34|
| 4e7|40000000.00|
| 1| 1.00|
| 0| 0.00|
| | null|
+-----+-----------+
Note: I encountered a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional
on the second df2.show
, but it seems like a known issue and can safely be ignored: https://kb.databricks.com/scala/decimal-is-fractional-error
Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.
Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.
Hey, i don't think i have the bandwidth rn to complete this, so please feel free to work on it. I have made some progress here on a branch in my fork, so feel free to take inspirations from there as well if needed!
What is the problem the feature request solves?
What is the problem the feature request solves?
We currently delegate to DataFusion when casting from string to decimal and there are some differences in behavior compared to Spark.
4e7
produces40000000.00
in Spark, andnull
in DataFusion.
,-
,+
and empty string producenull
in Spark, and0.0
in DataFusion0
produces0
in Spark, andnull
in DataFusionCannot cast string to decimal with negative scale
). We could choose to fallback to Spark for this use case (or ifSQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED
is enabled)Describe the potential solution
No response
Additional context
I used the following test in
CometCastSuite
to explore this.Describe the potential solution
No response
Additional context
No response