NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
54 stars 37 forks source link

[FEA] Add full table name in data_source_information table #1172

Closed wjxiz1992 closed 2 weeks ago

wjxiz1992 commented 4 months ago

Is your feature request related to a problem? Please describe. We are generating data according to the information provided in data_source_information. Current table doesn't contain a column that shows the full table name of the data source. Thus we have to go to the web UI to check the full table name.

Requiring functionality from: Profiling Tool

Reproduece step: Run profiling tool against one eventlog file

java -cp <PATH_TO>/rapids-4-spark-tools_2.12-24.06.2-SNAPSHOT.jar:$SPARK_HOME/jars/* com.nvidia.spark.rapids.tool.profiling.ProfileMain  --csv <PATH_TO_EVENTLOG>

Why this is high priority: (not sure if I could tell here, let me know if I should delete it) We got 30 queries from customer, we need to run them locally to test our software(spark-rapids). Currently we go to web UI or evenlog to get the table names. It would help if it's directly shown in the Profiling Tool output csv.

Describe the solution you'd like Add a column to show the full table name

Describe alternatives you've considered None

Additional context None

wjxiz1992 commented 4 months ago

cc @winningsix for visibility. This is helpful for our high priority task to add more query benchmarks.

amahussein commented 4 months ago

Investigated this on one of the internal tickets. The tools does not truncate the schema. It is truncated by Spark internal classes. The AQE updates the PlanInfo replacing the old planInfo which contains the full metadata.schema with an empty metadata/truncated schema field values.

tgravescs commented 4 months ago

@wjxiz1992 in the future please be sure to add information about what tool you are requesting functionality from (profiling/qualification/other). If possible add details about how you are running the tool and reproduce case. Also please add why this is high priority - ie what its going to be used for.

wjxiz1992 commented 4 months ago

@wjxiz1992 in the future please be sure to add information about what tool you are requesting functionality from (profiling/qualification/other). If possible add details about how you are running the tool and reproduce case. Also please add why this is high priority - ie what its going to be used for.

Sure thanks for the suggestion. Updated the issue description.

amahussein commented 1 month ago

@wjxiz1992
This problem seems to be more complicated than initially thought. Since Spark truncates the metadata in the new AdaptivePlan, the full schema will be be missing for those SQLPlans. It is even more difficult considering that nodeNames might change, and that sqlMetrics need to be mapped correctly.

I am working on it.