NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

Indicate what type of reader (multi-threaded cloud, coalescing, perfile) was used #5899

Open tgravescs opened 2 years ago

tgravescs commented 2 years ago

Is your feature request related to a problem? Please describe. It would be nice to be able to tell what type of reader we use with Parquet, Orc, etc. If user specifies it we can see config but otherwise we auto configure based on inputs. The reason it matters is the metrics are slightly different depending on which parts are multi-threaded. For instance with Coalescing reader, buffer time isn't multi-threaded but the actual read fs time is. With the multi-threaded cloud reader, the buffer time is multi-threaded.

The times are officially correct but depending on what you are comparing to it would be confusing. For instance your buffer time could be higher than your overall task time in the cloud reader.

We could also revisit to try to make it more consistent between readers but we will likely have some metric that is different.

It would be nice to see this in Spark UI and/or in profiling tool.

mattahrens commented 2 years ago

Confirmed with @tgravescs that this would be a general enhancement (not specific to tools) to surface reader that was used. Tools may need a separate enhancement to pick up update, but that could be determined after this implementation is complete.

abellina commented 2 years ago

I believe we should try to normalize the metrics so that they do mean the same. We can probably add a note in our documentation on how we did it, but when I look at bufferTime, I really expect to see the time the task spent buffering, or in the case of a backgrounded process, I think a compromise may be the amount of time blocked on buffering (not unlike fetch wait time for shuffle).

abellina commented 2 years ago

We are not sure all the metrics are clear now, so lets reopen this.