StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
687 stars 144 forks source link

Regent picking up CUDA target when Legion is not built with CUDA #1685

Closed reazulhoque closed 6 months ago

reazulhoque commented 7 months ago

We have Regent picking up CUDA target when underlying Legion is not built using CUDA ON. This results in segfault and needs to be addressed.

Stack Trace:

#0  0x0000000000000000 in ?? ()
#1  0x0000155522310067 in $main ()
#2  0x0000155522310032 in $wrapper ()
#3  0x000055555b738af3 in lj_vm_ffi_call ()
#4  0x000055555b6f4687 in lj_ccall_func ()
#5  0x000055555b6d8add in lj_cf_ffi_meta___call ()
#6  0x000055555b7366b6 in lj_BC_FUNCC ()
#7  0x000055555b6e4d09 in lua_pcall ()
#8  0x000055555687a242 in docall(lua_State*, int, int) ()
#9  0x000055555687982b in main ()
reazulhoque commented 7 months ago

@elliottslaughter @lightsighter please add information I might've missed. Thank you!

elliottslaughter commented 7 months ago

Please post the original stack trace so that users can find this via a search if they hit the same issue.

elliottslaughter commented 7 months ago

I think I understand some more about this failure now.

If you just build without CUDA (on a machine that has CUDA installed), e.g.:

$ module load cuda
$ CC=gcc CXX=g++ USE_GASNET=0 DEBUG=1 USE_CUDA=0 ./scripts/setup_env.py

Then running Regent will give you:

$ ./regent.py examples/circuit_sparse.rg -fcuda 1
GPU code generation failed since the cuInit function is missing (Regent might have been installed without CUDA support)

Regent has correctly identified that CUDA is not enabled, but as the message says, it does so by actually looking for cuInit via dlsym lookup in the current global namespace. This means that if anything loads CUDA, Regent will assume that CUDA is there, and then Regent will attempt to build GPU kernels.

(The reason for -fcuda 1 here is to force the error, otherwise Regent will happily shut off CUDA codegen and you'll never know anything happened.)

It turns out that once you turn on GASNet, GASNet auto-detects CUDA and this causes CUDA to get linked into the process even when Legion has CUDA disabled.

$ CC=gcc CXX=g++ USE_GASNET=1 CONDUIT=ibv DEBUG=1 USE_CUDA=0 ./scripts/setup_env.py
$ ./regent.py examples/circuit_sparse.rg -fcuda 1
[(nil)]
/scratch/eslaught/legion-setup-env/language/terra/bin/terra(+0x61d291c) [0x56430b48191c]

You can see that Regent is no longer rejecting to perform CUDA code generation. You can also clearly see that we have linked CUDA into Legion:

$ ldd ../bindings/regent/libregent.so | grep cuda
        libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007ffbbc9e8000)
        libicudata.so.66 => /lib/x86_64-linux-gnu/libicudata.so.66 (0x00007ffbba1be000)

So, I suppose that Regent should NOT consider whether or not CUDA is available in a process to be a reliable indication of whether or not Legion was actually built with CUDA, and we should investigate another mechanism.

elliottslaughter commented 7 months ago

The other reason why this was really difficult to track down was that the backtraces we started with just pointed to somewhere in main:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000155554e52066 in $main () at /home/scratch.mbauer_research/legion/language/src/regent/std.t:3847
#2  0x0000155554e52032 in $wrapper () at /home/scratch.mbauer_research/legion/language/src/regent/std.t:4715
#3  0x000055555b738af3 in lj_vm_ffi_call ()
#4  0x000055555b6f4687 in lj_ccall_func ()
#5  0x000055555b6d8add in lj_cf_ffi_meta___call ()
#6  0x000055555b7366b6 in lj_BC_FUNCC ()
#7  0x000055555b6e4d09 in lua_pcall ()
#8  0x000055555687a242 in docall(lua_State*, int, int) ()
#9  0x000055555687982b in main ()

And initial debugging from @lightsighter pointed to a failure so early in main that I didn't see a way for this to be anything except bad code generation or memory corruption:

Dump of assembler code for function $main:
   0x0000155554e52040 <+0>: push   %rbp
   0x0000155554e52041 <+1>: push   %r15
   0x0000155554e52043 <+3>: push   %r14
   0x0000155554e52045 <+5>: push   %r13
   0x0000155554e52047 <+7>: push   %r12
   0x0000155554e52049 <+9>: push   %rbx
   0x0000155554e5204a <+10>:    sub    $0x808,%rsp
   0x0000155554e52051 <+17>:    mov    %edi,0x34(%rsp)
   0x0000155554e52055 <+21>:    mov    %rsi,0x38(%rsp)
   0x0000155554e5205a <+26>:    movabs $0x0,%rax
   0x0000155554e52064 <+36>:    call   *%rax
=> 0x0000155554e52066 <+38>:    mov    %rax,%rbp
   0x0000155554e52069 <+41>:    movabs $0x900000000,%rax
...
(gdb) p $rax
$1 = 0
(gdb) p $rbp
$2 = (void *) 0x7fffffffc820

I still don't know exactly why this is the failure that results from doing CUDA code generation in a process where Legion has been built without CUDA, but we went through a bunch of rabbit holes that didn't ultimately go anywhere until we stumbled on the fact that the issue does not reproduce without GASNet (which led to wondering what that was actually changing).

elliottslaughter commented 7 months ago

The most immediate workaround for this issue is just passing -fcuda 0 when Legion does not have CUDA enabled.

In terms of a longer-term solution, I'm trying to figure out how to detect whether Legion has been built with CUDA or not. Unfortunately, Regent does not seem to be picking up any LEGION_USE_* variables:

$ cat test_legion_defines.t
import "regent"
for k, v in pairs(regentlib.c) do print(k, v) end
$ ./regent.py test_legion_defines.t | grep LEGION_USE
$ ./regent.py test_legion_defines.t | grep REALM_USE
REALM_USE_CACHING_ALLOCATOR     0
$ ./regent.py test_legion_defines.t | grep REGENT_USE
REGENT_USE_HIJACK       0

I believe the reason why we only find the two symbols is because Terra only parses defines with constant values. If you just #define VAR with no value, we won't parse it.

I suppose that, like REGENT_USE_HIJACK I could establish a new Regent-specific variable for CUDA:

#ifdef REALM_USE_CUDA
#define REGENT_USE_CUDA 1
#endif

Just to give me something to look up. Or we could define the Legion/Realm variable to have a 1 value so I can check it. @lightsighter do you have a preference?

lightsighter commented 7 months ago

Just to give me something to look up. Or we could define the Legion/Realm variable to have a 1 value so I can check it. @lightsighter do you have a preference?

I said this in zullip, but posting it here too. I think it would be better to use the new Realm machine configuration API to programmatically detect if Realm has been built with CUDA support or not. You can do that be checking to see if there is a "cuda" module: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/runtime_impl.h?ref_type=heads#L370 If you pass in "cuda" there and get back a null pointer then there's no CUDA support.

elliottslaughter commented 7 months ago

Here's a potential fix. It does not use the machine configuration API, but establishes a define inside the regent.h header like we did for the hijack.

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1222

If we move to a mode in which Realm has truly dynamic modules, we may need something fancier, but for now this seems fine.

elliottslaughter commented 7 months ago

@reazulhoque can you confirm the patch works for your case?

elliottslaughter commented 6 months ago

This has been merged to master. @reazulhoque Please confirm the fix.

reazulhoque commented 6 months ago

@elliottslaughter it works. Thank you for the quick fix!