NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

Fix tests failures in string_test.py #11030

Closed razajafri closed 2 months ago

razajafri commented 4 months ago
FAILED ../../../../integration_tests/src/main/python/string_test.py::test_endswith
FAILED ../../../../integration_tests/src/main/python/string_test.py::test_unsupported_fallback_substring_index
mythrocks commented 4 months ago

test_unsupported_fallback_substring_index fails with a legitimate cause:

E               pyspark.errors.exceptions.captured.NumberFormatException: For input string: "rdd_value_2"

The other tests all pass with ANSI mode disabled.

mythrocks commented 3 months ago

This is odd. I can't seem to repro this failure now.

mythrocks commented 3 months ago

I have double-checked my work. These tests don't fail.

I'm closing this. We can reopen this if we see failures in the future.

mythrocks commented 3 months ago

Yep, I think I spoke too soon. Reopening.

mythrocks commented 3 months ago

The problem with .endswith is proving elusive. While this can be repro-ed in test, its occurrence is occasional from the REPL. For a brief while, it could be repro-ed simply by adding the plugin jar to the class path. (i.e. not even enabling the plugin.) It appeared to have been some sort of shading error.

I'm still investigating, but this is proving a time sink.

mythrocks commented 2 months ago

Yep, this is still baffling. Here is the exception:

py4j.protocol.Py4JJavaError: An error occurred while calling o206.endsWith.
: java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.Column.expr()" because "x$1" is null
      at org.apache.spark.sql.Column$.$anonfun$fn$2(Column.scala:77)
      at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
      at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
      at org.apache.spark.sql.Column$.$anonfun$fn$1(Column.scala:77)
      at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
      at org.apache.spark.sql.package$.withOrigin(package.scala:111)
      at org.apache.spark.sql.Column$.fn(Column.scala:76)
      at org.apache.spark.sql.Column$.fn(Column.scala:64)
      at org.apache.spark.sql.Column.fn(Column.scala:169)
      at org.apache.spark.sql.Column.endsWith(Column.scala:1078)
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)

This is pointing into new code in Spark 4.0.

Column {
  UnresolvedFunction(Seq(name), inputs.map(_.expr), isDistinct, ignoreNulls = ignoreNulls)
}

The complaint seems to be that .expr can't be called on the null passed into .endswith(). (Note that the code sees this as a null Column, and not a literal.)

I'm unable to repro this from the command line. Attaching a debugger allows this code to run through as well.

This is occasionally reproducible from the pyspark shell. The exception is thrown from Spark CPU, and should not need the plugin for repro.

I'm fairly confident that this is a bug in Spark 4, that routes None as column, instead of a literal.

mythrocks commented 2 months ago

As for the problem highlighted in test_unsupported_fallback_substring_index, I'm fairly certain this is a bug in code-gen in Spark 4.0. Here's the stack trace:

scala> sql("select SUBSTRING_INDEX('a', '_', num) from mytable ").show(false)
java.lang.NumberFormatException: For input string: "columnartorow_value_0"
  at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  at java.base/java.lang.Integer.parseInt(Integer.java:668)
  at org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1449)
  at org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
  at org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
  at org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
  at org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1448)
  at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)

Edit: I have filed https://issues.apache.org/jira/browse/SPARK-48989 against Spark 4.x, to track the WholeStageCodeGen/NFE problem. This is happening on the CPU, without the plugin's involvement.