NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
44 stars 34 forks source link

Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

Closed tgravescs closed 1 month ago

tgravescs commented 1 month ago

Fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1032

I ran an event log through the qualification tool and it got labelled as not applicable because it had failed stages. Those failed stages though were cancelled by AQE runs.

We should take this into account in the qual tool.

The reasons in task show up as: Stage cancelled... The stage failure reason shows: Job 243 cancelled

tool output: 24/05/23 10:00:26 WARN QualificationEventProcessor: SQL execution id 47 had failures, skipping 24/05/23 10:00:26 WARN QualificationEventProcessor: SQL execution id 125 had failures, skipping

This PR fixes that by looking for cancelled in the failure messages ignores those as failures.

I tested on customer event log and this is working. Need to put that event log into our integration tests.

tgravescs commented 1 month ago

they should be separate. when I looked briefly at the profiling tool, I know its outputting failed jobs to files. We still want to do that as that is how Spark is showing them. I didn't look at all the rollups though to see where it they are affected. Again a separate issue which I don't think is as important.