Open abellina opened 1 year ago
Most of the metrics were done as a part of https://github.com/NVIDIA/spark-rapids/pull/7935 All that is left from this is "unspill hit rate" But this probably depends on us having unspill enabled by default.
I think we should consider not just "unspill" but also simply materializing a disk/host buffer as a +1. This would allow us to find operators that are stuck re-reading spilled buffers.. I can think of join doing this, not sure if others.
I'd like to piggy back on the spill work I am doing now with https://github.com/NVIDIA/spark-rapids/issues/7709. This metric will be nice to add but I think it should follow the spill work.
Upon insertion into the spill framework, a buffer provides a spill priority and a spill metrics callback. The priority dictates in what order will the buffers be spilled. When spill occurs, the callback is invoked to update metrics.
This task is to define/implement the metrics we want, and at what granularity.