Add unspill hit rate as a task level metric.

abellina commented 1 year ago

Upon insertion into the spill framework, a buffer provides a spill priority and a spill metrics callback. The priority dictates in what order will the buffers be spilled. When spill occurs, the callback is invoked to update metrics.

This task is to define/implement the metrics we want, and at what granularity.

Metrics level: task-level, sql exec-level, app-level.
Bytes spilled: to host/disk, and at what level (task, sql exec, app)?
Time waiting for spill
Amount of time unspilling or materializing a spilled buffer
Unspill "hit rate"? If I unspilled a buffer, how often is it being reused? This may expose a symptom that priorities need updating in some exec nodes.

revans2 commented 1 year ago

Most of the metrics were done as a part of https://github.com/NVIDIA/spark-rapids/pull/7935 All that is left from this is "unspill hit rate" But this probably depends on us having unspill enabled by default.

abellina commented 7 months ago

I think we should consider not just "unspill" but also simply materializing a disk/host buffer as a +1. This would allow us to find operators that are stuck re-reading spilled buffers.. I can think of join doing this, not sure if others.

abellina commented 2 months ago

I'd like to piggy back on the spill work I am doing now with https://github.com/NVIDIA/spark-rapids/issues/7709. This metric will be nice to add but I think it should follow the spill work.

NVIDIA / spark-rapids

Add unspill hit rate as a task level metric. #7670