NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
749 stars 221 forks source link

[BUG] Incorrect cast of string columns containing various infinity notations with trailing spaces #10794

Closed gerashegalov closed 1 week ago

gerashegalov commented 1 month ago

The actual value that the UT caught is infinity but it boils down to the trailing space

Repro:

$ ~/dist/spark-3.3.0-bin-hadoop3/bin/spark-shell \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.test.enabled=true \
  --conf spark.rapids.sql.explain=ALL \
  --jars dist/target/rapids-4-spark_2.12-24.06.0-SNAPSHOT-cuda11.jar 

GPU:

scala> Seq("infinity ", "Infinity ", "+Inf ").toDF.coalesce(1).selectExpr("cast(value as float)").collect()
24/05/10 12:02:57 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(value#1 as float) AS value#5 will run on GPU
    *Expression <Cast> cast(value#1 as float) will run on GPU
  *Exec <CoalesceExec> will run on GPU
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> value#1 could run on GPU

res0: Array[org.apache.spark.sql.Row] = Array([null], [null], [null])

CPU:

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> Seq("infinity ", "Infinity ", "+Inf ").toDF.coalesce(1).selectExpr("cast(value as float)").collect()
res2: Array[org.apache.spark.sql.Row] = Array([Infinity], [Infinity], [Infinity])
Feng-Jiang28 commented 3 weeks ago

Failed on corner case: cast_string_to_float.cu: check_for_inf didn't consider trailing white spaces.

thirtiseven commented 1 week ago

Fixed in https://github.com/NVIDIA/spark-rapids-jni/pull/2063