StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Regent: support turning off hijack with cuda 10+ #1059

Open syamajala opened 3 years ago

syamajala commented 3 years ago

https://github.com/StanfordLegion/legion/blob/stable/language/src/regent/cudahelper.t#L136-L144

It does not look like it is possible to turn off cuda hijack with cuda 10+ as cudaConfigureCall, cudaLaunch, and cudaSetupArgument have been removed.

elliottslaughter commented 1 year ago

I think this is what https://gitlab.com/StanfordLegion/legion/-/tree/hijack_registration_hack is for, but we haven't kept that branch up to date.

syamajala commented 1 year ago

Are the changes only in regent? If so I can take a crack at trying to update the branch. This issue is a blocker for testing ucx #1396.

elliottslaughter commented 1 year ago

I squashed the branch and pushed it to: https://gitlab.com/StanfordLegion/legion/-/commits/hijack_registration_hack_squash

It definitely won't rebase cleanly. But you can actually see the changes now, so if you want to recreate the branch (from scratch, probably, because I don't think any automated approach will work), go ahead. Hopefully we can minimize the changes this time so that it's easier to keep up-to-date.

magnatelee commented 1 year ago

I'm actually confused. The issue was about Regent using old CUDA runtime calls. As far as I know, that issue should be fixed in the latest master branch, as you can see here: https://github.com/StanfordLegion/legion/blob/master/language/src/regent/gpu/cuda.t#L573-L574

The codegen no longer relies on those deprecated APIs but is using the new one (cudaLaunchKernel). What changes do we need to port over from hijack_registration_hack?

elliottslaughter commented 1 year ago

I believe @syamajala saw another error when the hijack was disabled. Maybe he can post it here?

magnatelee commented 1 year ago

I believe the issue hijack_registration_hack was trying to work around was the codegen passing fake host and device function pointers to hijackCudaRegisterFunction that eventually calls into __cudaRegisterFunction. Normally the call should be hijacked by Realm and nothing bad would happen, but I vaguely recall there were cases Realm failed to intercept the call which then would go into the CUDA runtime and blow up. So the workaround was to call directly into Realm to register functions, as this line indicates:

https://gitlab.com/StanfordLegion/legion/-/blob/hijack_registration_hack/bindings/regent/regent_cudart_hijack.cc#L35-36

I suspect this doesn't actually solve the problem that happens when the hijack is turned off.

magnatelee commented 1 year ago

I believe @syamajala saw another error when the hijack was disabled. Maybe he can post it here?

That error message might be helpful for us to figure out the right arguments to pass to registration calls.

syamajala commented 1 year ago

The error i was seeing was this: assertion failed: kernel launch failed

magnatelee commented 1 year ago

The error message isn't that enlightening... can you run it with compute sanitizer and paste what you see? it should tell you exactly why the kernel launch failed. (I have my theory, but want to confirm that's really the case.)

syamajala commented 1 year ago

I'm not sure if the sanitizer is giving me anything useful. I'm running a build with debug symbols. Here is what I got:

========= Program hit invalid device function (error 98) on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x424d26]
=========                in /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1
=========     Host Frame:cudaLaunchKernel [0x67ec8]
=========                in /cm/shared/apps/cuda11.4/toolkit/11.4.2/targets/x86_64-linux/lib/libcudart.so.11.0
=========     Host Frame: [0x28c3f]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//build/hept/libphysical_tasks.so
=========     Host Frame: [0x26449]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//build/hept/libphysical_tasks.so
=========     Host Frame: [0x938516]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9a47d0]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9a8568]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9f68ae]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9a73e7]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9a7a0a]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9aeb66]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x9baa29]
=========                in /lustre/scratch/vsyamaj/legion_s3d_no_hijack//legion/language/build/lib/librealm.so.1
=========     Host Frame: [0x82de]
=========                in /lib64/libpthread.so.0
=========     Host Frame:clone [0xfbe83]
=========                in /lib64/libc.so.6
========= 
assertion failed: kernel launch failed

Here is a stack trace:

#0  0x0000155552164238 in nanosleep () from /lib64/libc.so.6
#1  0x000015555216413e in sleep () from /lib64/libc.so.6
#2  0x00001555487dbff8 in Realm::realm_freeze (signal=6) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/runtime_impl.cc:183
#3  <signal handler called>
#4  0x00001555520d370f in raise () from /lib64/libc.so.6
#5  0x00001555520bdb25 in abort () from /lib64/libc.so.6
#6  0x000015554fd250a3 in $<CalcVolumeTask>.56 () from /lustre/scratch/vsyamaj/legion_s3d_no_hijack//build/hept/libphysical_tasks.so
#7  0x000015554fd246c9 in $__regent_task_CalcVolumeTask_cuda () from /lustre/scratch/vsyamaj/legion_s3d_no_hijack//build/hept/libphysical_tasks.so
#8  0x00001555487b7516 in Realm::LocalTaskProcessor::execute_task (this=0x4cc65e0, func_id=50, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/proc_impl.cc:1135
#9  0x00001555488237d0 in Realm::Task::execute_on_processor (this=0x154eac038ba0, p=...) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/tasks.cc:302
#10 0x0000155548827568 in Realm::KernelThreadTaskScheduler::execute_task (this=0x4cc6af0, task=0x154eac038ba0) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/tasks.cc:1366
#11 0x00001555488758ae in Realm::Cuda::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (this=0x4cc6af0, task=0x154eac038ba0) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/cuda/cuda_module.cc:1568
#12 0x00001555488263e7 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4cc6af0) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/tasks.cc:1105
#13 0x0000155548826a0a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x4cc6af0) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/tasks.cc:1217
#14 0x000015554882db66 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x4cc6af0) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/threads.inl:97
#15 0x0000155548839a29 in Realm::KernelThread::pthread_entry (data=0x5d79730) at /lustre/scratch/vsyamaj/legion_s3d_no_hijack/legion/runtime/realm/threads.cc:781
#16 0x00001555524662de in start_thread () from /lib64/libpthread.so.0
#17 0x0000155552197e83 in clone () from /lib64/libc.so.6
magnatelee commented 1 year ago

I think this explains the cause:

========= Program hit invalid device function (error 98) on CUDA API call to cudaLaunchKernel.

This basically means the kernel name given to the cudaLaunchKernel call is invalid, which indicates that Regent doesn't correctly refer to the kernels it generated. This is actually what I was expecting, and the right way to fix this is to replace this line with a code that pulls out a proper host pointer to each kernel and uses it:

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/cuda.t#L406

We weren't previously doing this, as we didn't cache identities of kernels. But now I believe Regent is maintaining a list of kernels it generated (as you can see here), so we should be able to make this work with the CUDA ABI.

syamajala commented 11 months ago

@wilsonCernWq is running into this issue. I think for he's trying to do with S3D we need to turn cuda hijack off? How much work is this to fix?

elliottslaughter commented 11 months ago

What exactly is the issue you're running into?

I thought we had already validated the no-hijack execution with Regent.

syamajala commented 11 months ago

The issue I saw before which I was able to reproduce now as well is assertion failed: kernel launch failed and according to all the debugging I did before this is due to kernels not getting registered properly when we turn off cuda hijack.

syamajala commented 6 months ago

This issue is still there...

muraj commented 1 month ago

Any update on this issues @magnatelee ? 24.09 release is this month and I was hoping to mark CUDART hijack as deprecated by now.

syamajala commented 4 weeks ago

This would be very useful to have on Perlmutter and I believe in order to use ADIOS with S3D there its in fact needed as well.

elliottslaughter commented 5 days ago

Could someone provide guidance on the correct APIs to call? For context here's the current code:

https://github.com/StanfordLegion/legion/blob/c032dab254f423ccab36d05c47fed42b94f0b3f5/language/src/regent/gpu/cuda.t#L235-L261

Note there are two versions, one for PTX and one for cubin.

I see in the driver API there are at least three API options: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MODULE.html

It's not obvious to me what data format these require: do I pass the same fatbin type struct I've been passing to the hijack APIs?

Also it's not obvious to me how to get the definition of CUmodule. Please note that we have tried very hard to NOT include the CUDA driver header file into this code, as you'll see up above in the codebase:

https://github.com/StanfordLegion/legion/blob/c032dab254f423ccab36d05c47fed42b94f0b3f5/language/src/regent/gpu/cuda.t#L53-L85

So we need to somehow do something along these lines for any new APIs I need.

@muraj @magnatelee

muraj commented 5 days ago

cuModuleLoadDataEx is just cuModuleLoadData with some extra options. cuModuleLoadFatBinary I think is a very old API that no one ever uses any more. Pretty much everyone just uses cuModuleLoadData{Ex}. There is also a new API cuLibraryLoadData, which exposes CUkernels. These are "device agnostic" versions cuModuleLoadData{Ex} that will load a "library" that ends up being a collection of modules for each device (more of a convenience factor plus some GPU memory optimizations).

Also it's not obvious to me how to get the definition of CUmodule. Please note that we have tried very hard to NOT include the CUDA driver header file into this code, as you'll see up above in the codebase:

It's an opaque pointer. Doing the following in C/C++ is good enough and is what we do in cuda_module.h for Realm:

struct CUmod_st;             // This line multiple times is okay
typedef CUmod_st *CUmodule;  // Make sure to not include this line if you're also including cuda.h

It's not obvious to me what data format these require: do I pass the same fatbin type struct I've been passing to the hijack APIs?

This is fairly clearly defined in the CUDA documentation for cuModuleLoadDataEx and friends, see below. Let me know if there's still some confusion:

The image may be a cubin or fatbin as output by nvcc, or a NULL-terminated PTX, either as output by nvcc or hand-written.

So we need to somehow do something along these lines for any new APIs I need.

Do you really need to go through the complete linking step here (i.e. using cuLink*)? Do you have a fully self-contained PTX/cubin/fatbin or do you need to, at runtime, link several of these images together? If not, then cuModuleLoadData is enough (Ex if you want some extra debug information like PTX JIT logs, or want to control JIT optimizations, or other flags and knobs detailed in the documentation).

I don't know enough about the code paths here to really give good guidance on what exactly should be used. I can give examples of code using cuModuleLoadData and cuLink* APIs if that would help?

elliottslaughter commented 1 day ago

I think I've got this mostly worked out. The main remaining question I have is about how to get the task stream.

In HIP we've got an API called hipGetTaskStream that returns the HIP task stream. I'd need something like this in CUDA to return the CUstream, I guess cuGetTaskStream to match Driver API conventions.

By the way, I've found that I have to code most/all of this against the Driver API. cuModule* and friends appear to be Driver API only. I just want to double check I'm not missing anything because most of the pre-existing code (aside from the linking bits) assumed the Runtime API.

muraj commented 1 day ago

@elliottslaughter Yes, cuModule* are Driver API only. I would not recommend linking directly against libcuda.so though. I would instead recommend building a loader with dlopen/dlsym or try using cuDriverGetProcAddress (after retrieving this symbol via dlsym of course).

As to cuGetTaskStream -- we have provided a Realm::CudaModule::get_task_stream() that does provide a CUstream, can you use this? I would rather not provide global functions that look like cuda driver APIs. One reason is what to do if a similar API are eventually provided by CUDA in the future, then we have a symbol conflict. Another reason being, if we keep it inside the module, the module can be dynamically looked up (via Runtime::get_module<>()) and proper error handling can happen dynamically, rather than requiring it at compile-time (say, if you tried to use this with REALM_USE_CUDA off, all callers would need a ifdef REALM_USE_CUDA, but if the shared library version of realm did have CUDA, then the application wouldn't leverage it). I would like to get us toward querying and configuring applications based on what we find we can support rather than what we compile for.

Let me know if you have further questions or concerns. Thanks!

elliottslaughter commented 1 day ago

Looks like it's called CudaModule::get_task_cuda_stream(), but I see it now.

I don't care what it's called, but I need to call this via pure C APIs. If you want to adapt the entire Realm module API to C just so I can call this, I suppose that's fine. Anyway, as a pure-C consumer of Realm, this is inaccessible at the moment so I need to get to the functionality somehow, and in HIP it's provided via a single (C API) wrapper function which does make it somewhat more easy.

muraj commented 1 day ago

Anyway, as a pure-C consumer of Realm

I'm confused... how do you do things like spawn tasks and such, which currently is only possible with C++ apis in Realm AFAIK? Do you have your own C layer here and if so, can you not provide a wrapper for this yourself?

Future looking: I would love to move Realm to C APIs, but now's probably not the time :P

eddy16112 commented 1 day ago

@elliottslaughter Hip::get_task_hip_stream() is different from hipGetTaskStream, after calling Hip::get_task_hip_stream(), you will have to call Hip::set_task_ctxsync_required(false); to disable the context synchronization explicitly. Here is an example https://gitlab.com/StanfordLegion/legion/-/blob/master/test/realm/task_stream.cc?ref_type=heads#L55.

lightsighter commented 1 day ago

I'm confused... how do you do things like spawn tasks and such, which currently is only possible with C++ apis in Realm AFAIK?

Regent mostly just calls Legion APIs and uses Legion's C API (which @elliottslaughter built the initial version and now I maintain).

can you not provide a wrapper for this yourself?

That's probably what we'll need to do. @elliottslaughter if you want to add a wrapper to the Legion C API for this then go ahead.

muraj commented 1 day ago

Regent mostly just calls Legion APIs and uses Legion's C API

Ah, understood. I guess this is just another reason why Realm needs a C API...

if you want to add a wrapper to the Legion C API for this then go ahead.

Agreed. Having Legion have a C API for this is probably for the best right now, at least until Realm properly supports a C API you can use. Yes, we have hipGetTaskStream, and no, that's not going away any time soon, though we might propose to deprecate it in the (hopefully near) future.

elliottslaughter commented 18 hours ago

There's an in progress MR at https://gitlab.com/StanfordLegion/legion/-/merge_requests/1502

Please look at regent_gpu.cc and see if it looks the way you want. Note that the REALM_USE_* ifdefs are still required because otherwise the code will not link.

At this point it still seems to be getting mixed up because there are still some vestigial uses of CUDA Runtime APIs in the registration code. In particular I need equivalents for:

I'm sure the Driver API has equivalents, but the trivial Google search doesn't seem to be pulling them up....

muraj commented 15 hours ago

@elliottslaughter the driver API deals with contexts rather than devices, which is why you're probably not getting Google hits. Here's the equivalents you'll probably need:

cudaGetDeviceCount - cuDeviceGetCount cudaSetDevice - cuCtxSetCurrent cudaGetDevice - cuCtxGetCurrent or cuCtxGetDevice, depending on what you need.

Keep in that that the cuCtx* apis take a CUcontext. In order to get the same context that the runtime internally uses, you need to call cuDevicePrimaryCtxRetain, and make sure to release it at some point with cuDevicePrimaryCtxRelease. All threads in the application calling this function will get the same CUcontext for a device assuming there is at least one reference to it (one outstanding retain without a release).

Hope this helps.

elliottslaughter commented 15 hours ago

Ok, I think I've got this working. I establish the context at:

https://gitlab.com/StanfordLegion/legion/-/blob/regent-no-cudart-hijack/language/src/regent/gpu/cuda.t#L353-357

One tricky thing is that I realized I cannot call cuDevicePrimaryCtxRelease, because if I do the CUmodule/CUfunction handles get invalidated. So I guess I just have to leak the context. Hopefully that's ok:

https://gitlab.com/StanfordLegion/legion/-/blob/regent-no-cudart-hijack/language/src/regent/gpu/cuda.t#L373-374

Local tests are passing so I'll run CI now.

muraj commented 15 hours ago

Leaking the context isn't ideal, but it should be fine.

elliottslaughter commented 15 hours ago

The MR here is passing in my local testing both with and without the Realm hijack:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1502