LLNL / axom

CS infrastructure components for HPC applications
BSD 3-Clause "New" or "Revised" License
157 stars 27 forks source link

ptxas fatal: unresolved external functions (Primal) #795

Open samuelpmishLLNL opened 2 years ago

samuelpmishLLNL commented 2 years ago

When building axom with cuda support for serac (with NVCC 11.4 on Ubuntu 20.04), I'm seeing more unresolved extern function errors coming from primal. It's hard to tell how many there actually are, since the compiler just admits defeat after reporting the first one. I get:

ptxas fatal : Unresolved extern function '_ZN4axom6primal6detail28intersectOnePermutedTriangleERKNS0_5PointIdLi3EEES5_S5_S5_S5_S5_dddRNS0_6VectorIdLi3EEEbd'

which demangles to

axom::primal::detail::intersectOnePermutedTriangle(axom::primal::Point<double, 3> const&, axom::primal::Point<double, 3> const&, axom::primal::Point<double, 3> const&, axom::primal::Point<double, 3> const&, axom::primal::Point<double, 3> const&, axom::primal::Point<double, 3> const&, double, double, double, axom::primal::Vector<double, 3>&, bool, double)

which is declared here and defined here and

Sure enough, this function (and others around it) are declared AXOM_HOST_DEVICE in their .hpp file, and then defined without annotations in the .cpp file. Note: simply adding the AXOM_HOST_DEVICE annotation on these functions does not resolve the issue.

It may be possible to move the implementations of these functions back into the headers as an inelegant workaround (as was done in https://github.com/LLNL/axom/pull/788), but the commit history on this file includes comments like

"Primal in [sic] no longer header-only."

which makes me think that it was an intentional design decision to arrange them this way.

It seems like it's impossible to use them in any CUDA kernels as is (in the current state), so one possible solution might be to just remove the AXOM_HOST_DEVICE annotations on their function declarations.

The full command for compiling that translation unit is given below (although the actual error is a linker error that appears later):

  "command": "/usr/local/cuda-11.4/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/usr/bin/mpic++ -DAXOM_DEBUG
-DCAMP_HAVE_CUDA -I/home/sam/code/serac/build/include -I/home/sam/code/serac_libs_cuda/gcc-9.3.0/camp-0.3.0/include
-I/home/sam/code/serac_libs_cuda/gcc-9.3.0/raja-0.14.0/include -I/home/sam/code/serac/axom/src/axom/core/../.. 
-isystem=/home/sam/code/serac/axom/src/thirdparty -isystem=/home/sam/code/serac_libs_cuda/gcc-9.3.0/umpire-6.0.0serac/include
 -isystem=/usr/local/cuda-11.4/include -arch sm_61 -pthread -G --expt-extended-lambda --expt-relaxed-constexpr  
--expt-extended-lambda -O2 -g -DNDEBUG --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xcompiler=-fPIC
-Xcompiler=-fopenmp -std=c++17 -x cu -c /home/sam/code/serac/axom/src/axom/primal/operators/detail/intersect_impl.cpp -o
CMakeFiles/primal.dir/operators/detail/intersect_impl.cpp.o",

It is worth noting that this issue only shows up when adding the -G flag to enable debugging cuda kernels. It's also not obvious to me why this .cpp file is being treated like a cuda sourcefile when it appears to do nothing GPU-related.

publixsubfan commented 2 years ago

Note: simply adding the AXOM_HOST_DEVICE annotation on these functions does not resolve the issue.

Out of curiosity, is this because we aren't building with relocatable device code enabled (as mentioned in #788)?

white238 commented 2 years ago

I tried turning on separable compilation and it made no difference. I hit this monday in my current PR for reasons I can't explain. I am deep into trying to make this header compile.

samuelpmishLLNL commented 2 years ago

Note: simply adding the AXOM_HOST_DEVICE annotation on these functions does not resolve the issue.

Out of curiosity, is this because we aren't building with relocatable device code enabled (as mentioned in #788)?

I can't really make enough sense of what axom is doing in its build system to answer. However, this pattern (a mixed C++/CUDA project, with separable compilation) is not hard to accomplish in a cmake project, see for example: serac.zip

edit: probably easier to just post a github link, rather than a .zip file: https://github.com/samuelpmishLLNL/serac_build_example

trws commented 2 years ago

Taking a look at this after @white238 brought it up earlier. FWIW, not using RDC is almost certainly part of the problem. Clearly not the whole thing since @white238 tried it and it didn't take care of it, but it can't link with separate implementation without it. I'm actually surprised it's that symbol, and not one of the ones that are marked inline despite not being defined in the header, those should all be causing breakage at least at link time if not compile time.

On this point:

It's also not obvious to me why this .cpp file is being treated like a cuda sourcefile when it appears to do nothing GPU-related.

These functions are marked host/device, they have to be compiled for the device and the file as a cuda file, if that's going to work.

We're 90ish% of the way to having a working clang cuda build for the raja suite btw, which might give some different/useful errors for this if you'd like to take a crack at it.

samuelpmishLLNL commented 2 years ago

Taking a look at this after @white238 brought it up earlier. FWIW, not using RDC is almost certainly part of the problem. Clearly not the whole thing since @white238 tried it and it didn't take care of it, but it can't link with separate implementation without it. I'm actually surprised it's that symbol, and not one of the ones that are marked inline despite not being defined in the header, those should all be causing breakage at least at link time if not compile time.

the example repo in my earlier comment shows how to address this kind of issue. Just using CMake's native CUDA support:

set_target_properties(my_lib PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

links it properly and emits the appropriate flags (-rdc=true). If I remember correctly, using CMake's "object library" feature led to issues with separable compilation.


On this point:

It's also not obvious to me why this .cpp file is being treated like a cuda sourcefile when it appears to do nothing GPU-related.

These functions are marked host/device, they have to be compiled for the device and the file as a cuda file, if that's going to work.

You're right, there are host device annotations on some of these functions after all, but they're only on the function declarations, not the definitions. I didn't even know that was valid syntax. This seems like a dangerous convention to me, because those annotations impose constraints on what you can and can't do in the function, and omitting the annotations from the definition obfuscates those requirements.

It's a moot point now, as this file seems to have disappeared since this issue was created.

However, my point about not just compiling everything with nvcc still stands: there are significant drawbacks to using nvcc to compile regular C++ files (limited features, more compiler bugs, considerably longer compilation time). For those reasons, it can be beneficial to separate translation units into two categories:

This lets you still enjoy faster compile times, and use all the available language features in .cpp files, and only impose the drawbacks of nvcc to a minimal subset of the project.


We're 90ish% of the way to having a working clang cuda build for the raja suite btw, which might give some different/useful errors for this if you'd like to take a crack at it.

👍

trws commented 2 years ago

links it properly and emits the appropriate flags (-rdc=true). If I remember correctly, using CMake's "object library" feature led to issues with separable compilation.

Yes, object libraries are explicitly not supported by the separable compilation handling in cmake, I don't know why since I would have expected that to be the easiest to make work, but it's explicitly called out in their docs that the combination can't work.

I didn't even know that was valid syntax. This seems like a dangerous convention to me, because those annotations impose constraints on what you can and can't do in the function, and omitting the annotations from the definition obfuscates those requirements.

FWIW neither did I. It's a really, really dangerous thing to do, especially for the cuda compilers other than NVCC that actually take H/D annotations into account for overload resolution. That would likely be a really good thing to fix, and maybe even for us to find a way to lint on for codes.

However, my point about not just compiling everything with nvcc still stands: there are significant drawbacks to using nvcc to compile regular C++ files (limited features, more compiler bugs, considerably longer compilation time).

There's an additional wrinkle here that may not be obvious. Compiling with nvcc isn't actually a problem, and doesn't cause these issues. If you pass nvcc a regular c++ file with no cuda in it, and without forcing cuda language mode, it passes it through to the native compiler unmodified and with almost no overhead. In other words, calling nvcc on everything is fine as long as we don't force everything to be compiled in cuda language mode. That's something we should work on a way to express so that the BLT and app code can handle the case cleanly, because you're absolutely right that we waste some compilation time and potentially lose features for the lack of it. In the meantime, the only way I know to deal with this in a project using BLT is to compile some files with a blt_add_library that doesn't use the cuda dependency, probably switching to cuda_runtime, then it should compose ok and compile more efficiently. The equivalent switch works for hip compilation as well, switching the -x hip on or off appropriately.

samuelpmishLLNL commented 2 years ago

There's an additional wrinkle here that may not be obvious. Compiling with nvcc isn't actually a problem, and doesn't cause these issues. If you pass nvcc a regular c++ file with no cuda in it, and without forcing cuda language mode, it passes it through to the native compiler unmodified and with almost no overhead. In other words, calling nvcc on everything is fine as long as we don't force everything to be compiled in cuda language mode.

I agree with this for the most part, but the overhead

sam@provolone:~/code/cuda_sandbox$ cat main.cpp 
#include <iostream>
int main () { std::cout << "hello world" << std::endl; }
sam@provolone:~/code/cuda_sandbox$ time g++ main.cpp
real    0m0.217s
sam@provolone:~/code/cuda_sandbox$ time nvcc main.cpp
real    0m0.277s
sam@provolone:~/code/cuda_sandbox$ time nvcc main.cpp -x cu
real    0m0.948s

and feature limitations

sam@provolone:~/code/cuda_sandbox$ /usr/bin/g++ main.cpp -std=c++2a
sam@provolone:~/code/cuda_sandbox$ nvcc main.cpp -x c++ -ccbin /usr/bin/g++ -std=c++2a
nvcc fatal   : Value 'c++2a' is not defined for option 'std'

still exist for passing c++ files through nvcc rather than going directly to the c++ compiler (although still much better than treating C++ files as CUDA files).


In the meantime, the only way I know to deal with this in a project using BLT is to compile some files with a blt_add_library that doesn't use the cuda dependency, probably switching to cuda_runtime, then it should compose ok and compile more efficiently. The equivalent switch works for hip compilation as well, switching the -x hip on or off appropriately.

Why not just do

add_library(my_lib STATIC "foo.cpp" "bar.cu")

CMake already handles all of this automatically if you use the file extensions to differentiate between C++/CUDA languages (e.g. processes foo.cpp by the CXX compiler and bar.cu by nvcc). And for when the project is built without cuda support, you can easily just grab all the .cu files and pretend they're .cpp instead[^1]

if (ENABLE_CUDA)                                                                 
  enable_language(CUDA)                                                          
else()                                                                           
  file(GLOB_RECURSE cu_files ${PROJECT_SOURCE_DIR} *.cu)                         
  set_source_files_properties(${cu_files} PROPERTIES LANGUAGE CXX)               
endif() 

Just because a project uses BLT doesn't mean it isn't allowed to use the built-in CMake commands, right?

[^1]: any kernel definitions in those .cu files would need to be hidden behind conditional compilation

trws commented 2 years ago

Yes, there is a small overhead, but the 0.06 seconds is completely lost in a large build, and the feature limitations are purely on flags, which can be dealt with by passing the flag correctly: -Xcompiler -std=c++2a.

As to the rest, if you're using the raw cmake features with a new enough cmake, feel free to do that and it should work just fine. Avoiding BLT can introduce inconsistencies though because not all flags and dependencies are applied through the global configuration variables, and CUDA and CXX flags can be specified separately and differently, as well as some being applied by BLT's macros directly. For CXX it shouldn't cause issues, for CUDA or other device dependent targets it might give you inconsistent dependencies, flags, or even compilers, and I'm not sure it's worth dealing with that manually.

cyrush commented 2 years ago

It seems strange to use the .cu extension for device code b/c .cu is widely known as NVIDIA specific extension, and CUDA isn't the only thing in use.

Even if you tell CMake it's in fact its only CXX for the case where you don't have CUDA on, its confusing when other programming models are involved.