NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
43 stars 34 forks source link

[BUG] Error mesasges in failed_job.csv and failed_stages.csv are not fully displayed #1132

Closed wjxiz1992 closed 1 week ago

wjxiz1992 commented 1 week ago

Describe the bug As the title says, some error messages are cut off and partial displayed in the failed_jobs.csv and failed_stages.csv:

appIndex,jobID,jobResult,failureReason
1,7,"JobFailed","java.lang.Exception: Job aborted due to stage failure: Task 462 in stage 10.0 failed 5 times, most r"
1,8,"JobFailed","java.lang.Exception: Job 8 cancelled "
1,9,"JobFailed","java.lang.Exception: Job 9 cancelled "
1,10,"JobFailed","java.lang.Exception: Job 10 cancelled "

But for JobID=7, the full error messages from web UI is: image

There seems to be a string size limit for Profiling Tool to get the error message, and from my inspection, it is 100 chars. The longest string I can see from the failed_* files are:

java.lang.RuntimeException: Native split: shuffle writer split failed - Cannot shrink partition buff
Job aborted due to stage failure: Task 462 in stage 10.0 failed 5 times, most recent failure: Lost t

They are all 100 char length.

Steps/Code to reproduce bug It can be any spark app that contains failed stages/jobs.

Expected behavior Full error messages captured in the csv files.

winningsix commented 1 week ago

cc @tgravescs @mattahrens any chance to prioritize this? We want to summarize the reasons of job failures from customer side.

amahussein commented 1 week ago

Filed a PR to implement the feature request https://github.com/NVIDIA/spark-rapids-tools/pull/1135