NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
50 stars 37 forks source link

[BUG] Qualification does not catch unsupported func #1046

Closed nvliyuan closed 3 months ago

nvliyuan commented 4 months ago

I ran below scripts and expect the Qualification tool to catch the unsupported function MinBy, but failed.

df = spark.createDataFrame([
    ("Java", 2012, 20000), ("dotNET", 2012, 5000),
    ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
    schema=("course", "year", "earnings"))
df.groupby("course").agg(min_by("year", "earnings")).show()

run tool:

SPARK_HOME=/Users/yuali/Documents/sparks/spark-3.3.1-bin-hadoop3
LOG=/Users/xx/Documents/customer/xx/20240525/testlog
TOOL_JAR=/Users/xx/Documents/sparks/qualification_tool/rapids-4-spark-tools_2.12-24.04.0.jar

java -cp $TOOL_JAR:$SPARK_HOME/jars/* \
    com.nvidia.spark.rapids.tool.qualification.QualificationMain -p $LOG

empty content in rapids_4_spark_qualification_output_unsupportedOperators.csv

image
nartal1 commented 3 months ago

@nvliyuan - I couldn't reproduce this issue. I tried on Spark-3.2, 3.3 and 3.4. Also with tools version - 24.02.05-SNAPSHOT, 24.04.0 and latest 24.04.1-SNAPSHOT. Could you please check. min_by is classified as unsupported.

Output of rapids_4_spark_qualification_output_unsupportedOperators.csv based on the query mentioned in the description:

App ID,SQL ID,Stage ID,ExecId,Unsupported Type,Unsupported Operator,Details,Stage Duration,App Duration,Action
"local-1717711113162",0,2,"0","Exec","HashAggregate","Contains unsupported expr",90,410483,"Triage"
"local-1717711113162",0,2,"0","Expr","min_by","Unsupported",90,410483,"Triage"
"local-1717711113162",0,0,"1","Exec","HashAggregate","Contains unsupported expr",569,410483,"Triage"
"local-1717711113162",0,0,"1","Expr","min_by","Unsupported",569,410483,"Triage"
"local-1717711113162",0,0,"2","Exec","LocalTableScan","Unsupported",569,410483,"IgnorePerf"
"local-1717711113162",0,-1,"3","Exec","AdaptiveSparkPlan","Unsupported",0,410483,"IgnoreNoPerf"

Output of explain in spark-shell:

df.explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[course#10], functions=[min_by(year#11, earnings#12)])
   +- Exchange hashpartitioning(course#10, 200), ENSURE_REQUIREMENTS, [plan_id=42]
      +- HashAggregate(keys=[course#10], functions=[partial_min_by(year#11, earnings#12)])
         +- LocalTableScan [course#10, year#11, earnings#12]
nvliyuan commented 3 months ago

Hi @nartal1 , I know what happened, I ran the query with the plugin enabled, and although the event log is the same as the CPU run, it cannot generate the output. Since Q tool is for CPU eventlog, so this issue is invalid, close the issue, thx for investigating.