[FEA] Metric for maximum GPU memory per task

abellina commented 2 years ago

The maximum amount of GPU memory each task uses is a very helpful metric to know if an application is getting close to needing to spill or not.

Tracking the memory currently on the GPU, or spilled to host memory, etc is also really interesting.

The problem is how to gather this metric in an efficient way. The Retry framework could keep track of the amount of memory that is allocated on a given thread, and the amount that is also deallocated/freed by that thread. It would not take into account the memory that is then freed by other threads (like in the case of spill, or UCX shuffle). Instead we would almost want to associate each allocation with a given thread, but that can be very memory intensive on the host, especially because we are likely to see thousands of buffers active.

We should experiment to see how expensive this is in practice and if it is not too bad implement it.

abellina commented 2 years ago

Thought about this issue a bit more, what I think we want is a version of the tracking_resource_adaptor but, rather than have a single map for all threads, I think that we want to keep track of the maximum outstanding GPU footprint per thread. Also to note, the main motivation here would be to figure out if our estimation on memory usage for some GPU code is higher than anticipated, to help us debug waste or inform heuristics to control what tasks we allow on the GPU.

This should allow us to do the following:

val maxOutstandingUsage = withMemoryTracking { 
  val gpuData = materialize data on gpu
  val result = withResource(gpuData) { _.callCudfFunction }
  result.close() 
    // at this point our maximum outstanding should be:
    // gpuData + max(allocated) inside of `callCudfFunction`
}

In this scenario when we enter the withMemoryTracking block, we would ask a per-thread tracking resource to start tracking this thread before we materialize data. The materialization of gpuData incurs calls to rmm to get memory, so that adds to the outstanding amount, and then the call to the cuDF code could be allocations that are kept around (outstanding) for a while, allocations and frees that happen within the C++ code before the kernel, or results from this code. So we can keep track of how much is outstanding at any given time by adding to a thread-local variable how many bytes have been requested, and subtracting when we call free.

If one of our allocations failed and we handled them via a spill it shouldn't matter. That is because the spill code should be careful to disable the tracking for those spills (e.g. a withoutMemoryTracking call). This means we wouldn't discount frees in this thread for some other thread's allocations that are irrelevant to the code being tracked.

I hope/believe this could be a pretty low overhead system. Note this doesn't, I don't think, help tracking when an expensive kernel is loaded, as far as I understand that can be a one-time-penalty when we open the shared library. I know we have seen this with some of the regular expression kernels in the past. Pinging @jlowe on this overall for comments.

abellina commented 2 years ago

I think one approach here is to have a stack of simple memory tracking info in RmmJni. When a withMemoryTracking block is issued we push to the stack one of these objects. The tracking_resource_adaptor could then check this stack for the current thread, and if it has something in it, it uses the top tracker to track allocations for now.

When withMemoryTracking is finishing, it calls a function in the RMM jni bits to pop this element from the stack. If it is the last element, we have turned the feature off. If it is not the last element we get the amount tracked in this scope and add the maximum outstanding we just popped to the next element in the stack (the calling scope also saw that maximum outstanding), and we continue to track with the remaining tracker in the stack.

We also need to keep a set of addresses we allocated in this thread, unfortunately. Given spill, the current thread may need to spill to satisfy an allocation. It seems we could ignore frees that we didn't allocate while tracking. The hope is that these withMemoryTracking blocks are as close as possible to a cuDF call.

abellina commented 1 year ago

Nsys has added memory tracking capabilities as of late, and we believe we can use the correlationId + NVTX ranges to accomplish this as a post processing step given an NVTX range. We should investigate if this solution does what we need.

wjxiz1992 commented 1 year ago

Hi @abellina I am trying to profile the GPU memory usage during a query run. I used nsys to profile, but didn't find metrics like peak memory usage

I was using NVIDIA Nsight Systems version 2022.2.1.31-5fe97ab installed in our internal cluster. I saw a post about it: https://forums.developer.nvidia.com/t/nsys-measure-memory/118394 which is posted on 2021, but it contains the memory usage part in the graph...

Update: The memory usage metrics are disabled by default, it can be turned on by an extra nsys argument --cuda-memory-usage=true Then we can see the memory utilization part in the graph:

abellina commented 1 year ago

I haven't used this feature, the main question I'd have is whether it works with a pool, especially the async pools. It most definitely does not work with ARENA because that's all CPU managed, but cudaAsync I'd hope shows it.

wjxiz1992 commented 1 year ago

The profile result above is from a run with ASYNC pool.

NVIDIA / spark-rapids

[FEA] Metric for maximum GPU memory per task #6745