StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
670 stars 146 forks source link

Realm GPU Profiling Is Not Precise #1732

Open lightsighter opened 1 month ago

lightsighter commented 1 month ago

Realm provides a profiling response for measuring the upper and lower bound on when kernels launched by a GPU task are executed. The current implementation though is imprecise on the start time. It marks the start time at point where the task starts running on the host and not when the first kernel is actually launched on the GPU. This is true both with and without the CUDA runtime hijack. This leads to profiles that mistakenly suggest that the GPU is busy running kernels when it is not.

@jjwilke for visibility.

muraj commented 1 month ago

cperry/cupti should be nearly production ready to deal with this. I'm using CUPTI to provide the tight bounds on the timestamps. Unfortunately, due to some limitations with cupti, we have to enable most of the activities' recording fairly early on for the duration of the application, which can cause performance issues. So we'll need to expose a flag or something to enable this and fallback to the old way when we can't.

lightsighter commented 1 month ago

Unfortunately, due to some limitations with cupti, we have to enable most of the activities' recording fairly early on for the duration of the application,

For Legion applications this should be fine since we turn on the profiling for the whole run. It's either off or on.

How much overhead would cupti cause if say I only added it to the first task launch and then didn't use it for the rest of the run?

elliottslaughter commented 1 month ago

And how much overhead are we actually talking about here (in ballpark terms)?

If it's really excessive then I think a tradeoff in precision vs overhead would be appreciated. If it's small enough then we may live with just having it on all the time.

muraj commented 1 month ago

@elliottslaughter Actually, on further research, I was able to limit the activity recording pretty much down to just when the task requests it, and only for the context in question. The gpu time recording will incur launch overhead at each call for the recording, but compared to what we do with the cudart hijack, this should be minimal. I don't have a proper measurement of the overhead at the moment, I'll have to ask someone with proper test to confirm, as we really only have rudimentary tests for this at the moment. Ballparking (as requested), launch latency should add perhaps another 0.25us GPU time to each launch (in order to record the start and stop of each kernel, memcpy, memset, etc), which is not introduced by us but by CUPTI. CPU time, since the retrieving of the buffer is done in another thread and we wake up and wait for that thread to flush the profiling records from CUPTI, the starting overhead is not significant but the end of task overhead may be decent depending on how many CUDA operations are done in the same GPU context. I don't really have a number for this and I've tried narrowing the scope of this as much as I could with CUPTI. The biggest issue is I have to enable the activity records for each driver and runtime API call, which can quickly fill these buffers. All we really need are the kernel, memcpy, and memset records with their corresponding external correlation records. Unfortunately, CUPTI won't add the external correlation records without enabling all the API records as well.

How much overhead would cupti cause if say I only added it to the first task launch and then didn't use it for the rest of the run?

@lightsighter In the current iteration of the branch, there shouldn't be any overhead if you don't explicitly request it. I do have a push/pop of the external correlation id at each beginning, end, and task switch, but those are thread local operations that are super fast and I don't really have a good way to enable/disable this as needed.

lightsighter commented 1 month ago

Some more details @muraj and I discussed and which I'll document here for everyone:

In the case that Realm does not see any GPU activity during the task, then it is going to return 0 for both the start_time and finish_time of the OperationTimelineGPU. We'll need to modify the profiler to know how to interpret this and not draw a box in the case that no GPU activity was detected in the GPU task.

Currently Realm is only going to detect kernel launches and memcpy operations done on the GPU. There are lots of other kinds of CUDA activity that we could detect using CUPTI, but it adds a bunch of complexity and code that is probably not going to be exercised very often. Consequently, I've given my blessing that we just capture kernels and memcpy operations for now since that should handle >95% of the things we want Realm to profile on the GPU anyway. If there's something else that we should profile people should speak up and say what it is.

launch latency should add perhaps another 0.25us GPU time to each launch

I think we can live with that.

muraj commented 1 month ago

@lightsighter by default the start and end timestamps were originally set to INVALID_TIMESTAMP which is -INT_MAX. Do we want to keep to that instead of zero or should we change INVALID_TIMESTAMP to zero?

lightsighter commented 1 month ago

I'd be fine with keeping it as is with the -INT_MAX. Although just to be clear, it should be -LONG_LONG_MAX right since the timestamps are long long?

elliottslaughter commented 1 month ago

Any sentinel value should definitely not be something that can occur in real data. I think we looked at this a while ago and concluded that while a 0 timestamp is unlikely, it is not impossible.

Having said that, -LLONG_MAX is not the same as LLONG_MIN. Presumably that's what you meant? https://cplusplus.com/reference/climits/

muraj commented 1 month ago

Ah, you're right, LLONG_MIN is what we're using

muraj commented 1 month ago

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1403 is ready for review, please let me know if this is acceptable. Thanks!