Support VecGeom 1.2 series

sethrj commented 2 years ago

Update our CMake to support the newer version of VecGeom. Don't require it until #421 .

sethrj commented 2 years ago

I have an environment on celeritas with vecgeom 1.2.0, and it built, but unfortunately everything that even links against vecgeom ends up with a CUDA error when launching a kernel, e.g.:

unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidDeviceFunction: invalid device function" thrown in the test body.

mrguilima commented 2 years ago

I can build Celeritas with vecgeom 1.2.0 built with:

-DVECGEOM_GDML=ON  -DVECGEOM_ENABLE_CUDA=ON  -DCMAKE_CUDA_ARCHITECTURES=70

Moreover, most of the tests pass, but celeritas_global_Stepper and app/demo-loop fail. I am trying to find out how to fix these two tests.

drbenmorgan commented 2 years ago

Some additional info on trying this on a CentOS7 box - build setup is Spack environment with GCC 10, CUDA 11.4, CUDA_ARCHITECTURES is 61 (a Quadro P400) and C++17 throughout. With VecGeom 1.2.0, celeritas builds without warning/error, but there are two sets of failures that sound the same/similar to those @sethrj noted:

5 - corecel/Collection (Failed)

[ RUN      ] SimpleCollectionTest.algo_device
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidDeviceFunction: invalid device function" thrown in the   test body.
[  FAILED  ] SimpleCollectionTest.algo_device (0 ms)
[----------] 3 tests from SimpleCollectionTest (0 ms total)

10 - corecel/StackAllocator (Failed)
90 - celeritas/phys/Physics (Failed)
109 - app/demo-interactor (Failed)

then the second case (with some differences in the line number in VecgeomParams:

58 - celeritas/ext/Vecgeom (Failed)

src/celeritas/ext/VecgeomParams.cc:76: debug: Converting to CUDA geometry
src/celeritas/ext/VecgeomParams.cc:84: debug: Transferring geometry to GPU
CUDA reported error with message: "invalid device function"
CUDA reported error with message: "invalid device function"
unknown file: Failure
C++ exception with description   "src/celeritas/ext/VecgeomParams.cc:91: celeritas: cuda error: invalid device function cudaPeekAtLastError()" thrown in SetUp().

66 - celeritas/field/FieldPropagator (Failed)
68 - celeritas/field/UserParamField (Failed)
69 - celeritas/field/UserMapField (Failed)
106 - celeritas/track/TrackInit (Failed)
111 - app/demo-rasterizer (Failed)

@mrguilima, do you build VecGeom with both static and shared libs, or just one? The above definitely smell of the old issue of linking both the shared/static cudart library. I'll try and trace through the link commands of the above and input libs to see if this is happening.

mrguilima commented 2 years ago

For the libraries to be used by Celeritas, I have been building VecGeom with -DBUILD_SHARED_LIBS=OFF.

drbenmorgan commented 2 years ago

Thanks @mrguilima, I'll try the static build - that's certainly indicative of a cudart issue as static vecgeom should "just work" in all downstream cases.

drbenmorgan commented 2 years ago

For info, I've just tried builds with

Static VecGeom + default celeritas (so shared celeritas libs): has same errors as reported above
Static VecGeom + static celeritas: seems fine

That's odd, but at least means I've got working/failing builds to compare.

mrguilima commented 2 years ago

Another data point: in my local builds (gcc 9.2.0, cuda 11.4.1, celeritas master, vecgeom 1.2.0), and the demo-loop tests are running successfully now. However, the test/celeritas_global_Stepper runs ok in Debug mode, but it fails on RelWithDebInfo mode:

(cuda-gdb) where
#0  0x0000000001820b50 in vecgeom::cuda::BVH::CheckDaughterIntersections (this=<optimized out>, localpoint=...,
    localdir=..., step=<optimized out>, last=<optimized out>, hitcandidate=<optimized out>)
    at /work1/celeritas/lima/work/cele/install/include/VecGeom/base/BVH.h:94
#1  0x000000000181a160 in celeritas::detail::BVHNavigator::ComputeStepAndHit (localpoint=..., localdir=...,
    step_limit=8.7608759681590181e+22, in_state=..., out_state=..., hitcandidate=<error reading variable>)
    at /work1/celeritas/lima/work/cele/src/celeritas/ext/detail/BVHNavigator.hh:122
#2  0x00000000017d24d0 in celeritas::detail::BVHNavigator::ComputeStepAndNextVolume (globalpoint=..., globaldir=...,
    step_limit=8.7608759681590181e+22, in_state=..., out_state=..., push=0)
    at /work1/celeritas/lima/work/cele/src/celeritas/ext/detail/BVHNavigator.hh:322
#3  0x0000000001840fe0 in celeritas::VecgeomTrackView::find_next_step (this=0x7fffdaffef58,
    max_step=8.7608759681590181e+22) at /work1/celeritas/lima/work/cele/src/celeritas/ext/VecgeomTrackView.hh:280
#4  0x000000000199b730 in celeritas::LinearPropagator::operator() (this=0x7fffdafff0e0, dist=8.7608759681590181e+22)
    at /work1/celeritas/lima/work/cele/src/celeritas/field/LinearPropagator.hh:76
#5  0x00000000017fb050 in celeritas::detail::along_step_track (track=...)
    at /work1/celeritas/lima/work/cele/src/celeritas/global/generated/../detail/AlongStepActionImpl.hh:128
#6  0x000000000183d4d0 in celeritas::detail::TrackLauncher<void (&)(celeritas::CoreTrackView const&)>::operator() (
    this=0x7fffdafffbe0, thread=...)
    at /work1/celeritas/lima/work/cele/src/celeritas/track/detail/TrackLauncherImpl.hh:41
#7  0x00000000018595e0 in __nv_static_63__50_tmpxft_00006ba3_00000000_7_AlongStepAction_cpp1_ii_16c32506__ZN9celeritas9generated74_GLOBAL__N__50_tmpxft_00006ba3_00000000_7_AlongStepAction_cpp1_ii_16c3250617along_step_kernelENS_7CoreRefILNS_8MemSpaceE1EEE<<<(25,1,1),(256,1,1)>>> (data=...)
    at /work1/celeritas/lima/work/cele/src/celeritas/global/generated/AlongStepAction.cu:33
(cuda-gdb)

Any suggestions?

pcanal commented 2 years ago

static VecGeom + default celeritas (so shared celeritas libs): has same errors as reported above

Errors are expected. You can not mix statically linked "relocatable cuda code" with shared linked "relocatable cuda code" (but usually it is an early problems)

sethrj commented 2 years ago

@mrguilima Can you add details of the errors you've recently encountered with your spack VecGeom 1.2 build here?

@drbenmorgan Have you had a chance to dig further into the errors?

mrguilima commented 2 years ago

Here is one CMake configuration issue (cmake 3.23.0):

Running ccmake on Celeritas with -DBUILD_SHARED_LIBS=ON configures fine, but with that set to OFF I get this message at generation step:

CMake Error: install(EXPORT "celeritas-targets" ...) includes target "celeritas" more than once in the export set. Generating done

drbenmorgan commented 2 years ago

@sethrj, sorry been a little side tracked, but now on top of my stack.

@mrguilima, I've been able to reproduce this - I think celeritas always assumes shared libs will be built, but in static only mode you end up with duplicated target names from the RDC library creation. I'll try and prep a fix for this.

sethrj commented 2 years ago

@mrguilima That last error you're seeing sounds like an oversight in the install(TARGETS code due to the extra library complexity. I'll see if I can reproduce and fix.

sethrj commented 2 years ago

@drbenmorgan we were writing at literally the same time apparently, and I just now saw your comment! I have a three-line fix. This issue is independent of the vecgeom version or the issues we're seeing here.

drbenmorgan commented 2 years ago

Thanks @sethrj, indeed the fix is independent of the main issue here (but will be good to have #464 merged to help out).

mrguilima commented 2 years ago

More data points (I am using gcc 9.2.0, cuda 11.4.1 and vecgeom 1.2.0). I have built (no build errors) and ran tests for all combinations of VecGeom's (v1.2.0) and Celeritas' shared/static builds (CMake variables DCMAKE_SHARED_LIBS=ON / OFF).

The results are that it doesn't matter how VecGeom is built (shared or static), all tests will pass if Celeritas is built STATIC, and some tests will fail if Celeritas is built in SHARED mode.

Here is a list of the 10 tests which are failing on both modes of Celeritas:

The following tests did not run: 106 - celeritas/random/CurandPerformance (Disabled)

The following tests FAILED: 6 - corecel/Collection (Failed) 11 - corecel/StackAllocator (Failed) 60 - celeritas/ext/Vecgeom (Failed) 66 - celeritas/field/FieldPropagator (Failed) 68 - celeritas/field/UserParamField (Failed) 69 - celeritas/field/UserMapField (Failed) 91 - celeritas/phys/Physics (Failed) 107 - celeritas/track/TrackInit (Failed) 110 - app/demo-interactor (Failed) 112 - app/demo-rasterizer (Failed)

sethrj commented 2 years ago

@mrguilima That's consistent with what I've seen: basically anything fails if it launches a CUDA kernel but is linked against vecgeom (if using shared libs).

sethrj commented 2 years ago

@pcanal has been working on this (he managed to get a successful build with vecgeom 1.2 as long as an older cmake was used!), and it seems to be related to -lcudart_static: see https://github.com/celeritas-project/celeritas/pull/10 .

sethrj commented 2 years ago

From @pcanal and my debugging conversation: P:

and you seem to be right, it seems to be an order of library loading difference. So I will figure exactly which change it is and why it matters. I confirm that it is my old friend -lcudart_static that is back (this library and -lcudart are dead ennemies but the linker does not seems to notice) So, “easy” enough, all I need to do is in celeritas_target_link_libraries whose target is an executable I need to ‘inherit’ the information on whether this is a shared or static cuda run-time. Should be fixed soon. In the case where celeritas’ CMakeList.txt reads:
list(APPEND _demo_loop_libs CUDA::cudart)
was the intent to (a) add the CUDA shared library runtime environment or (b) add one of cudart or cudart_static as ‘appropriate’? (in practice cmake does (a))

S:

Hmm I guess I expected a and b to be the same? Cmake usually hides whether something is static or not

P:

There is both CUDA::cudart and CUDA::cudart_static and explicit add respectively …/libcudart.so and ../libcudart_static.a)

P:

as side note they are necessary with the current trunks to link Celeritas+static with VecGeom+shared but are not needed for Celeritas+static with VecGeom+static nor is it needed for Celeritas+shared with VecGeom+shared) ; [I am not yet sure why it is needed in the mixed case]

S:

I would guess that in shared + shared, the upstream .so is transitively available because of linking against vecgeom; and perhaps in static + static it's added in the final link step by cmake? The mixed case would make sense for static vecgeom + shared celeritas if we have the "no unresolved symbols" check on: in that case, the cuda library symbols won't be available unless we explicitly link it in

So @pcanal is a pretty reliable solution as easy as:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -85,6 +85,9 @@ if(BUILD_SHARED_LIBS AND NOT DEFINED CMAKE_INSTALL_RPATH)
   # Inform installed binaries of internal library rpaths
   set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_FULL_LIBDIR}")
 endif()
+if(NOT DEFINED CMAKE_CUDA_RUNTIME_LIBRARY)
+  set(CMAKE_CUDA_RUNTIME_LIBRARY Static)
+endif()

 # Build flags
 option(CELERITAS_DEBUG "Enable runtime assertions" ON)
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index b0bca5eb..149141a0 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -218,10 +218,14 @@ if(CELERITAS_USE_CUDA OR CELERITAS_USE_HIP)
 endif()

 if(CELERITAS_USE_CUDA)
+  set(_cudartsuffix)
+  if(CMAKE_CUDA_RUNTIME_LIBRARY STREQUAL "Static")
+    set(_cudartsuffix _static)
+  endif()
   if(CELERITAS_RNG STREQUAL "CURAND")
-    list(APPEND PUBLIC_DEPS CUDA::cudart)
+    list(APPEND PUBLIC_DEPS CUDA::cudart${_cudartsuffix})
   else()
-    list(APPEND PRIVATE_DEPS CUDA::cudart)
+    list(APPEND PRIVATE_DEPS CUDA::cudart${_cudartsuffix})
   endif()
 elseif(CELERITAS_USE_HIP)
   set_source_files_properties(

pcanal commented 2 years ago

It is unclear. In first approximation, I was thinking:

set(CMAKE_CUDA_RUNTIME_LIBRARY Shared)

To be the safest but I also think that we might be able to calculate it based on the actual libraries used (if any library used has any 'shared' RDC, you would need the Shared runtime environment) but before I code a solution, I want to recheck that we correctly understand why/how the wrong runtime get added [for example, I retested shared+static-explicitly-adding-runtime and this time it linked, so I am missing information].

sethrj commented 2 years ago

Follow-up conversation with @pcanal : since the problems seem to all stem from conflicting cudart and cudart_shared, we can avoid at least one instance of the conflicting libraries by replacing the CUDA::cudart target link with

  target_include_directories(celeritas SYSTEM ${_visibility} ${CUDAToolkit_INCLUDE_DIRS})

Then we can rely on the CUDA linker implicitly adding the correct libraries, or using the CMAKE_CUDA_RUNTIME_LIBRARY property.

sethrj commented 2 years ago

@pcanal I'm comparing (on my branch in #495) the build.ninja files between no-vecgeom, vecgeom 1.1.18, and vecgeom 1.2 for test/corecel_Collection:

for l in src/libceleritas.so CMakeFiles/celeritas_final.dir/cmake_device_link.o src/libceleritas_final.so CMakeFiles/corecel_Collection.dir/cmake_device_link.o test/corecel_Collection; do echo "~~~ $l ~~~"; grep -A4 "build $l" build.ninja | grep LINK_LIBRARIES ; done

These are building with both VecGeom and Celeritas using shared libs. (Vecgeom is built through spack.) From the vecgeom build output, it looks like it build with shared cudart and ldd shows it linking against that as well: I don't see any cudart_static in either vecgeom build output.

In my configuration (CUDA 11.5, CMake 3.23.1) -DCMAKE_CUDA_RUNTIME_LIBRARY=Static and -U CMAKE_CUDA_RUNTIME_LIBRARY produced identical behavior (i.e. the default is static).

In the table below, /libcudart.so stands for the linker using the full path to the cudart.so library, whereas -lcudart is just the linker flag.

VecGeom	CMAKE_CUDA_RUNTIME_LIBRARY	Celeritas	celeritas	celeritas_final dlink	celeritas_final link	test dlink	test link	Works?
—	—	#495	-lcudart_static	—	—	—	-lcudart_static	✅
—	—	master	/libcudart.so -lcudart_static	—	—	—	-lcudart_static	✅
—	Shared	#495	-lcudart	—	—	—	-lcudart	✅
—	Shared	master	/libcudart.so -lcudart	—	—	—	-lcudart	✅
1.1.18	—	#495	/libcudart.so -lcudart	-lcudart	/libcudart.so -lcudart	-lcudart_static	/libcudart.so -lcudart_static	✅
1.1.18	—	#497 ~shared	-lcudart	-lcudart	—	-lcudart	-lcudart	✅
1.1.18	—	master	/libcudart.so /libcudart.so -lcudart	-lcudart	/libcudart.so -lcudart	-lcudart_static	/libcudart.so -lcudart_static	✅
1.1.18	Shared	#495	/libcudart.so -lcudart	-lcudart	/libcudart.so -lcudart	-lcudart	/libcudart.so -lcudart	✅
1.1.18	Shared	master	/libcudart.so /libcudart.so -lcudart	-lcudart	/libcudart.so -lcudart	-lcudart	/libcudart.so -lcudart	✅
1.2.0	—	#495	-lcudart	-lcudart	-lcudart	-lcudart_static	-lcudart_static	❌
1.2.0	—	master	/libcudart.so -lcudart	-lcudart	-lcudart	-lcudart_static	-lcudart_static	❌
1.2.0	—	#495 ~shared	/libcudart.so -lcudart_static	-lcudart_static	—	-lcudart_static	-lcudart_static	✅
1.2.0	—	#497 ~shared	-lcudart	-lcudart	—	-lcudart	-lcudart	❌
1.2.0	Shared	master ~shared	/libcudart.so -lcudart	-lcudart_static	—	-lcudart	/libcudart.so -lcudart	✅
1.2.0	Shared	#495 ~shared	-lcudart	-lcudart_static	—	-lcudart	-lcudart	✅
1.2.0	Shared	#495	-lcudart	-lcudart	-lcudart	-lcudart	-lcudart	✅
1.2.0	Shared	master	/libcudart.so -lcudart	-lcudart	-lcudart	-lcudart	-lcudart	✅

Conclusions:

It's probably the right move to get rid of the CUDA::cudart link, but without vecgeom in the picture it doesn't seem to cause any trouble (even when linking against both static and shared).
Mixed shared/static seems to work for vecgeom 1.1.18 but not 1.2.0.
Shared CUDA libraries and static Celeritas builds always seem to work.

I also tried building VecGeom 1.2.0 with static libraries instead of shared and found no change in behavior. It looks like 1.2.0 always builds with -lcudart and not with -lcudart_static which is a little strange.

pcanal commented 2 years ago

See https://github.com/celerit`as-project/celeritas/pull/497 which fixes the broken build.

On the chart, note that libcudart and libcudart_static have (as far as I can tell), 'duplicate' weak symbols that are incompatible. I did not get narrow down how and when the problem/mix-up become fatals; one of the case where it does happen is when both of them are on the nvlink command line. Most of the issues I have seen where we end with a kernel not being able to start ended up being a case where the 2 where mixed up (but indeed as you see in the chart, sometimes it works likely because the way things falls, the "right" weak symbols is being picked up).

celeritas-project / celeritas

Support VecGeom 1.2 series #442