celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
62 stars 32 forks source link

Undefined references with CudaRDCUtils and vecgeom #1156

Open drbenmorgan opened 5 months ago

drbenmorgan commented 5 months ago

Whilst we've known for a while that linking VecGeom requires --no-as-needed to be explicitly passed to the linker of platforms that enable as-needed by default (e.g. Debian, Ubuntu), I think there's a more general issue/bug in the link/object structure created by CudaRDCUtils. AFAICT, the problem comes from as-needed making library link order important.

If we build Celeritas with LDFLAGS=-Wl,--as-needed on Alma9, then we get errors like the following (stripped down to highlight to causes):

[795/970] Linking CXX executable test/celeritas/celeritas_user_Diagnostic
FAILED: test/celeritas/celeritas_user_Diagnostic 
: && /usr/bin/c++ -Wall -Wextra -pedantic -fdiagnostics-color=always -O3 -DNDEBUG -Wl,--as-needed   
... 
/.../.spack-env/view/lib64/libvecgeomcuda.so  
/.../.spack-env/view/lib64/libvecgeom.a 
...
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeom.a(UnplacedCone.cpp.o): undefined reference to symbol '_ZNK7vecgeom3cxx9DevicePtrINS_4cuda13SUnplacedConeINS2_9ConeTypes13UniversalConeEEEE9ConstructIJdddddddEEEvDpT_'
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeomcuda.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

The problem is that libvecgeom.a (or .so were that there) needs symbols defined in libvecgeomcuda.so but as that occurs before libvecgeom, the linker blindly ignores it. That this is the case can be shown be modifying the link line to:

[795/970] Linking CXX executable test/celeritas/celeritas_user_Diagnostic
FAILED: test/celeritas/celeritas_user_Diagnostic 
: && /usr/bin/c++ -Wall -Wextra -pedantic -fdiagnostics-color=always -O3 -DNDEBUG -Wl,--as-needed   
... 
/.../.spack-env/view/lib64/libvecgeomcuda.so  
/.../.spack-env/view/lib64/libvecgeom.a 
**/.../.spack-env/view/lib64/libvecgeomcuda.so**

...
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeom.a(UnplacedCone.cpp.o): undefined reference to symbol '_ZNK7vecgeom3cxx9DevicePtrINS_4cuda13SUnplacedConeINS2_9ConeTypes13UniversalConeEEEE9ConstructIJdddddddEEEvDpT_'
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeomcuda.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

which will resolve the vecgeom error but create a whole new set:

/usr/bin/ld: lib64/libceleritas.so: undefined reference to `__cudaRegisterLinkedBinary_5cf4974c_31_AlongStepGeneralLinearAction_cu_f9b8e781'

In this case it's because the link contains libceleritas_final.so then libceleritas.so, the latter needing symbols from the former. Adding libceleritas_final.so after libceleritas.so fixes the link error, illustrating that it's a general problem with the RDC link structure:

We should fix this in CudaRDCUtils though I'm not exactly sure how right now, so comments and discussion welcome.