Closed syamajala closed 5 months ago
@muraj Can you take a look at this?
Could really use some help with this as Im trying to get some Gordon Bell runs done on Perlmutter.
After talked with @syamajala , we figured out that the issue is that we build the Realm with CUDART_HIJACK=ON, but the cray wrappers links the cudart automatically, so CUDART_HIJACK is not turned on during runtime, and all these kernels are not registered with Realm.
Yeah the cray wrappers are broken and have no way to turn off linking against cudart. I opened a NERSC ticket about this a year ago. Their solution was for me to manually link things by hand and just remove -lcudart
from the flags. They closed the ticket.
I am able to do my runs now.
I'm hitting the following assertion in Realm:
This only seems to be happening on Perlmutter. I was able to run on blaze and sapling without any problems. I tried cuda 11.7, 12.0, and 12.2 on Perlmutter, but they all have the same issue.
Here is a stack trace: