Open nartal1 opened 5 days ago
@amahussein @tgravescs - Would be good to know if the information in the output file is okay OR any columns need to be added/removed. Thanks!
This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.
Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?
This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.
Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?
I discussed with Tom offline. Let's remove the filter and make the file generic for all the Accumulators CPU/GPU.
I triggered qualification tool. The new file does not get generated under raw_metrics subdirectory
Thanks @amahussein ! I forgot that we have another path to generate the output files in raw_metrics. Updated the code to generate the csv file under raw_metrics directory. PTAL.
Should we drop "zero" rows to reduce the size of the output (rows that have 0-value in all the columns)?
I have updated the code to remove those rows where metrics are not updated at all i.e value of all columns is zero. Will revert it if we want those rows as well.
Do we want to list internal.metrics.* in this file? A concern is redundancy and that having same information in multiple files might be confusing.
I was not certain if we wanted to remove it since the issue was to dump all the metrics to a output file. @tgravescs - Could you please let us know if we need to keep internal.* ones ? Also, is it okay to remove rows from the output file if all the values are "zeros"?
Thanks @nartal1 I discussed offline with Tom.
For this PR, we can dump everything without filtering any rows. This means:
Later, I can investigate with Bilal the performance impact of re-aggregating the internal Spark metrics as part of figuring out low-hanging fruits for https://github.com/NVIDIA/spark-rapids-tools/issues/367
Also, let's change the name from stage_level_all_metrics_for_application.csv
to stage_level_all_metrics.csv
This fixes https://github.com/NVIDIA/spark-rapids-tools/issues/928 and fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1081
This PR adds a new output file(stage_level_all_metrics.csv) which has the information of all accumulables aggregated at stage level including Gpu accumulables(if any) in the eventlog.
In some cases, there is just one entry in the eventlog for a given accumulator. In such cases, the min,median and max values are assigned default value of
0
and the actual value is intotal
column. This is to similar where we assign default values for sql_metrics_for_application.csv.We drop those accumulators from output file if no metrics are generated.
The output format is as below:
Added unit test to simulate eventlog containing GPU metrics and capture them.