NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
53 stars 37 forks source link

[BUG] Improve log message when qual tool could not access an eventlog #1280

Closed kuhushukla closed 2 months ago

kuhushukla commented 2 months ago

Describe the bug If a file , say in hdfs file system is not accessible by the tool, the verbose output shows the following but does not propagate the error message from the exception. This will apply to other exceptions as well since the catch is for Exception

Steps/Code to reproduce bug Run qual tool on an eventlog that is not accessible/have permissions

spark_rapids qualification --eventlogs= hdfs:/nn:8020/my-app-eventlog
 WARN UnknownAppResult: File: hdfs:/nn:8020/my-app-eventlog, Message: AccessControlException: Got unexpected exception processing file: hdfs://nn:8020/my-app-eventlog

Expected behavior Add more info to the verbose log line. If this seems excessive we can skip it but I found the o/p useful while debugging and also brings more attention to the warning.

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

parthosa commented 2 months ago

1187 and #1235 resolved this issue when tools would not show any output. Now, we show the number of apps that are provided, successfully processed and are top candidates Additionally the status csv was updated to store the exact cause of error (if any) for each app/file provided

CMD:

spark_rapids qualification --platform onprem --eventlogs hdfs:/nn:8020/my-app-eventlog --tools_jar $SPARK_RAPIDS_TOOLS_JAR

Output

Tools Version: latest dev

Console

    - Application status report: /Users/psarthi/Work/tools-run/qual_20240812213733_3EaE3EdF/rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_status.csv

Qualification tool found no successful applications to process.

Report Summary:
----------------------  -
Total applications      1
Processed applications  0
Top candidates          0
----------------------  -

Status File

File: qual_2024xxx/rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_status.csv

|-------------------------------|---------|-------|-------------------------------------------------------------|
| Event Log                     | Status  | AppID | Description                                                 |
|-------------------------------|---------|-------|-------------------------------------------------------------|
| hdfs:/nn:8020/my-app-eventlog | FAILURE | N/A   | Incomplete HDFS URI, no host: hdfs:/nn:8020/my-app-eventlog |
|-------------------------------|---------|-------|-------------------------------------------------------------|

Could you test on the latest dev branch and let us know if the issue still persists?

kuhushukla commented 2 months ago

@parthosa my changes are against dev branch. I'm not sure my change corresponds to the status file you mentioned. When we print verbose and we are calling out the exception we should give info on how to triage it by simply including the stack.