NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
824 stars 236 forks source link

Add unspill hit rate as a task level metric. #7670

Open abellina opened 1 year ago

abellina commented 1 year ago

Upon insertion into the spill framework, a buffer provides a spill priority and a spill metrics callback. The priority dictates in what order will the buffers be spilled. When spill occurs, the callback is invoked to update metrics.

This task is to define/implement the metrics we want, and at what granularity.

revans2 commented 1 year ago

Most of the metrics were done as a part of https://github.com/NVIDIA/spark-rapids/pull/7935 All that is left from this is "unspill hit rate" But this probably depends on us having unspill enabled by default.

abellina commented 7 months ago

I think we should consider not just "unspill" but also simply materializing a disk/host buffer as a +1. This would allow us to find operators that are stuck re-reading spilled buffers.. I can think of join doing this, not sure if others.

abellina commented 2 months ago

I'd like to piggy back on the spill work I am doing now with https://github.com/NVIDIA/spark-rapids/issues/7709. This metric will be nice to add but I think it should follow the spill work.