host watermark metric - Githubissues

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

https://nvidia.github.io/spark-rapids

Apache License 2.0

822 stars 235 forks source link

host watermark metric #11725

Open zpuller opened 6 days ago

zpuller commented 6 days ago

Adds watermark metric to track the max amount of memory allocated on the host per task.

Tested manually and verified the values look sane in the UI and in logs.

zpuller commented 3 days ago

build

zpuller commented 2 days ago

overall lgtm, could we do a UT for this ?

It's possible and normally I would always be in favor of doing so but in this case I opted not to for the following reasons:

The HostAlloc depends on a TaskContext for reporting metrics, but in the existing unit tests there is no such context so it just does nothing under test.
There are no existing unit tests for all the other GpuTaskMetrics
Similarly, when I implemented equivalent feature for disk spill https://github.com/NVIDIA/spark-rapids/pull/11564 I did not add tests.
The overall functionality is not as critical; it doesn't impact the core data processing.

If you or others feel strongly I can still find a way to add tests regardless.

zpuller commented 2 days ago

I'd like for us to investigate the case where we can go negative on deallocation, that seems odd.

FYI I didn't actually observe this here, this was a carryover. from having observed it in the disk spill metric impl. In any case, I can still double check if/why that may happen.

jihoonson commented 1 day ago

Thanks @zpuller for addressing the comments. The code change looks good to me. As for the metric being negative, I agree that it would be better to understand when and why it happens. But since that issue doesn't seem to be introduced by this change, I would be fine with merging this PR and continuing the investigation if it takes long.

abellina commented 1 day ago

The way the accumulators here are getting combined (max) I don't think works in this case. Unless I am missing something. Because a pinned allocation (or the disk case, or the device case) could be allocated in thread 1 and freed on thread 2, we could have a negative value in thread 2, and over count thread 1's max.

I thought the idea behind the metric was to get the "max used memory", and so that means to me that really the task should sample the tracker (device/host/disk) for the max used memory during the lifecycle of the task. Yes that means the task is accounting for other tasks' memory, but the max is what we are after. So I think when we allocate/free, instead of sending the delta to the task specific counter to inc/dec, we set the watermark, and recompute the max.

revans2 commented 14 hours ago

I thought the idea behind the metric was to get the "max used memory", and so that means to me that really the task should sample the tracker (device/host/disk) for the max used memory during the lifecycle of the task. Yes that means the task is accounting for other tasks' memory, but the max is what we are after. So I think when we allocate/free, instead of sending the delta to the task specific counter to inc/dec, we set the watermark, and recompute the max.

Thanks @abellina I missed that. Yes the max and the memory allocated values need to be stored in a singleton/static location and protected with some kind of a lock/atomic. I think that would be enough to make this all work.

zpuller commented 13 hours ago

build