gerashegalov commented 2 years ago

Describe the bug At a handful occasions we needed to resort to setting spark.rapids.force.caller.classloader to the non-default value false as a workaround for bugs. However, we only have a smoke test enabled for this configuration. Without a full test pipeline run against this config we ended up incurring a few regressions over time.

Steps/Code to reproduce bug Manually run pytest-xdist with pseudo-distributed standalone local-cluster via:

TEST_PARALLEL=2 \
  PYSP_TEST_spark_rapids_force_caller_classloader=false \
  NUM_LOCAL_EXECS=2 \
  ./integration_tests/run_pyspark_from_build.sh -k test_explain_set_config

and observe failures like:

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan.
  : java.lang.NoClassDefFoundError: com/nvidia/spark/rapids/GpuOverrides$

full stacktrace

E                       at com.nvidia.spark.rapids.ExplainPlanImpl.explainPotentialGpuPlan(GpuOverrides.scala:4196)
E                       at com.nvidia.spark.rapids.ExplainPlan$.explainPotentialGpuPlan(ExplainPlan.scala:65)
E                       at com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(ExplainPlan.scala)
E                       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E                       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                       at java.lang.reflect.Method.invoke(Method.java:498)
E                       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E                       at py4j.Gateway.invoke(Gateway.java:282)
E                       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                       at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                       at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E                       at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E                       at java.lang.Thread.run(Thread.java:748)
E                   Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.GpuOverrides$
E                       at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
E                       at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
E                       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
E                       at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
E                       ... 15 more

Expected behavior We should not have exceptions with supported options. Unfortunately, this is one of the options that can't be solved via test parametrization because this needs to be applied before pytest Spark app is launched. More generally it may be an epic to identify more pre-pytest-launch settings like this.

Environment details (please complete the following information) local-cluster, Standalone cluster anywhere

Additional context

5646

GaryShen2008 commented 2 years ago

@gerashegalov Does this issue request to run all the integration test cases with the config PYSP_TEST_spark_rapids_force_caller_classloader=false?

pxLi commented 2 years ago

Sorry, I am confused. Is this a bug report or a feature request? If the latter one, can you elaborate more about the ENVs? like pytest-xdist + local cluster or vanilla IT against standalone cluster, can you share the expected commands to test in different scenarios? and do we need to run all IT cases, or just some specific cases for multiple spark shims?

Also we need some detailed ENV combinations to do the resource planning, thanks

gerashegalov commented 2 years ago

@gerashegalov Does this issue request to run all the integration test cases with the config PYSP_TEST_spark_rapids_force_caller_classloader=false?

yes, it should be parametrized at Jenkinks level because we cannot do at the pytest level. We should try to make this support as generic as possible because there can be more settings like this.

gerashegalov commented 2 years ago

Sorry, I am confused. Is this a bug report or a feature request?

bug in a sense that we left this feature without continuous testing and it was broken by later PRs.

If the latter one, can you elaborate more about the ENVs? like pytest-xdist + local cluster or vanilla IT against standalone cluster, can you share the expected commands to test in different scenarios? and do we need to run all IT cases, or just some specific cases for multiple spark shims?

Also we need some detailed ENV combinations to do the resource planning, thanks

Ideally I would like another instance of all tests we have with the ENV PYSP_TEST_spark_rapids_force_caller_classloader=false injected at whatever cadence the capacity allows but not less than weekly frequency.

NVIDIA / spark-rapids

[BUG] Create a nightly CI test pipeline for force.caller.classloader=false #5703

5646