Add all stage metrics to tools output

nartal1 commented 5 days ago

This fixes https://github.com/NVIDIA/spark-rapids-tools/issues/928 and fixes https://github.com/NVIDIA/spark-rapids-tools/issues/1081

This PR adds a new output file(stage_level_all_metrics.csv) which has the information of all accumulables aggregated at stage level including Gpu accumulables(if any) in the eventlog.

In some cases, there is just one entry in the eventlog for a given accumulator. In such cases, the min,median and max values are assigned default value of 0 and the actual value is in total column. This is to similar where we assign default values for sql_metrics_for_application.csv.

We drop those accumulators from output file if no metrics are generated.

The output format is as below:

appIndex,stageId,accumulatorId,name,min,median,max,total
1,"0",46,"duration",2694,4051,4051,6745
1,"0",48,"number of input batches",0,0,0,6
1,"10",1010,"gpuSemaphoreWait",0,1196,68449,11682104
1,"10",1013,"gpuSpillToDiskTime",229,2252,17319,1682175
1,"10",1015,"gpuReadSpillFromDiskTime",121,1790,9318,2034767
1,"10",1016,"gpuSplitAndRetryCount",0,0,0,1
1,"10",1018,"gpuSpillToHostTime",11,2087,15533,1783900
1,"10",1039,"output rows",177,476370,1376040,1015737409
1,"10",1040,"output columnar batches",177,61421,143354,118334405

Added unit test to simulate eventlog containing GPU metrics and capture them.

nartal1 commented 5 days ago

@amahussein @tgravescs - Would be good to know if the information in the output file is okay OR any columns need to be added/removed. Thanks!

amahussein commented 4 days ago

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?

I discussed with Tom offline. Let's remove the filter and make the file generic for all the Accumulators CPU/GPU.

One main fix to do in this PR is to generate the view by the Qualification tool as well.
this PR will also close issue 1081.

nartal1 commented 1 day ago

I triggered qualification tool. The new file does not get generated under raw_metrics subdirectory

Thanks @amahussein ! I forgot that we have another path to generate the output files in raw_metrics. Updated the code to generate the csv file under raw_metrics directory. PTAL.

Should we drop "zero" rows to reduce the size of the output (rows that have 0-value in all the columns)?

I have updated the code to remove those rows where metrics are not updated at all i.e value of all columns is zero. Will revert it if we want those rows as well.

Do we want to list internal.metrics.* in this file? A concern is redundancy and that having same information in multiple files might be confusing.

I was not certain if we wanted to remove it since the issue was to dump all the metrics to a output file. @tgravescs - Could you please let us know if we need to keep internal.* ones ? Also, is it okay to remove rows from the output file if all the values are "zeros"?

amahussein commented 16 hours ago

Thanks @nartal1 I discussed offline with Tom.

For this PR, we can dump everything without filtering any rows. This means:

the rows with all zero-values should be dumped out as well.
the internal metrics should show up too.

Later, I can investigate with Bilal the performance impact of re-aggregating the internal Spark metrics as part of figuring out low-hanging fruits for https://github.com/NVIDIA/spark-rapids-tools/issues/367

amahussein commented 16 hours ago

Also, let's change the name from stage_level_all_metrics_for_application.csv to stage_level_all_metrics.csv

NVIDIA / spark-rapids-tools

Add all stage metrics to tools output #1151