Atomic Failure on Windows with CUDA, main branch (2024.08.03.)

krasznaa commented 1 month ago

Unfortunately #275 seems to have introduced a failure in a setup that's not too easy to test in the CI. 😦 While building the latest code on Windows, with CUDA available, I bump into:

...
[build]   tmpxft_000062a0_00000000-7_test_cuda_jagged_vector_view_kernels.cudafe1.cpp
[build]   tmpxft_0000606c_00000000-7_test_cuda_edm_kernels.cudafe1.cpp
[build] C:\Users\krasz\ATLAS\vecmem\vecmem\core\include\vecmem/memory/impl/device_atomic_ref.ipp(187): error C3861: '__memorder_vecmem_to_builtin': identifier not found [C:\Users\krasz\ATLAS\vecmem\vecmem\build\tests\cuda\vecmem_test_cuda.vcxproj]
[build]   C:\Users\krasz\ATLAS\vecmem\vecmem\core\include\vecmem/memory/impl/device_atomic_ref.ipp(187): note: the template instantiation context (the oldest one first) is
[build]   C:\Users\krasz\ATLAS\vecmem\vecmem\tests\cuda\../common/jagged_soa_container_helpers.hpp(21): note: see reference to class template instantiation 'vecmem::testing::jagged_soa_interface<vecmem::edm::device<vecmem::edm::schema<vecmem::edm::type::scalar<int>,vecmem::edm::type::vector<float>,vecmem::edm::type::jagged_vector<double>,vecmem::edm::type::scalar<float>,vecmem::edm::type::jagged_vector<int>,vecmem::edm::type::vector<int>>>>' being compiled
[build]   C:\Users\krasz\ATLAS\vecmem\vecmem\tests\common\jagged_soa_container.hpp(18): note: see reference to class template instantiation 'vecmem::edm::device<vecmem::edm::schema<vecmem::edm::type::scalar<int>,vecmem::edm::type::vector<float>,vecmem::edm::type::jagged_vector<double>,vecmem::edm::type::scalar<float>,vecmem::edm::type::jagged_vector<int>,vecmem::edm::type::vector<int>>>' being compiled
[build]   C:\Users\krasz\ATLAS\vecmem\vecmem\core\include\vecmem/edm/device.hpp(121): note: see reference to class template instantiation 'vecmem::tuple<int *,vecmem::device_vector<TYPE>,vecmem::jagged_device_vector<double>,float *,vecmem::jagged_device_vector<int>,vecmem::device_vector<int>>' being compiled
[build]           with
[build]           [
[build]               TYPE=float
[build]           ]
...
[build] C:\Users\krasz\ATLAS\vecmem\vecmem\core\include\vecmem/memory/impl/device_atomic_ref.ipp(294): error C3861: '__atomic_fetch_add': identifier not found [C:\Users\krasz\ATLAS\vecmem\vecmem\build\tests\cuda\vecmem_test_cuda.vcxproj]
[build] C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\bin\HostX64\x64" -x cu   -IC:\Users\krasz\ATLAS\vecmem\vecmem\build\core\CMakeFiles -IC:\Users\krasz\ATLAS\vecmem\vecmem\core\include -IC:\Users\krasz\ATLAS\vecmem\vecmem\build\cuda\CMakeFiles -IC:\Users\krasz\ATLAS\vecmem\vecmem\cuda\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Users\krasz\ATLAS\vecmem\vecmem\build\_deps\googletest-src\googletest\include" -I"C:\Users\krasz\ATLAS\vecmem\vecmem\build\_deps\googletest-src\googletest" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir vecmem_test_cuda\x64\RelWithDebInfo  -maxrregcount=0   --machine 64 --compile -cudart static -std=c++14 --generate-code=arch=compute_52,code=[compute_52,sm_52] -Xcompiler="/EHsc -Zi -Ob1 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVECMEM_DEBUG_MSG_LVL=0 -DVECMEM_SOURCE_DIR_LENGTH=35 -DVECMEM_HAVE_PMR_MEMORY_RESOURCE -D"CMAKE_INTDIR=\"RelWithDebInfo\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DVECMEM_DEBUG_MSG_LVL=0 -DVECMEM_SOURCE_DIR_LENGTH=35 -DVECMEM_HAVE_PMR_MEMORY_RESOURCE -D"CMAKE_INTDIR=\"RelWithDebInfo\"" -Xcompiler "/EHsc /W4 /nologo /O2 /FS /Zi  /MD /GR" -Xcompiler "/Fdvecmem_test_cuda.dir\RelWithDebInfo\vc143.pdb" -o vecmem_test_cuda.dir\RelWithDebInfo\test_cuda_edm_kernels.obj "C:\Users\krasz\ATLAS\vecmem\vecmem\tests\cuda\test_cuda_edm_kernels.cu"" exited with code 2. [C:\Users\krasz\ATLAS\vecmem\vecmem\build\tests\cuda\vecmem_test_cuda.vcxproj]

So nvcc is trying to compile test_cuda_edm_kernels.cu, and gets confused about how to use vecmem::device_atomic_ref.

The preprocessor flags will need to be tweaked to fix this. 🤔

stephenswat commented 1 month ago

Hmm, I wonder if the following preprocessor condition fires on MSVC:

#elif defined(__clang__) || defined(__GNUC__) || defined(__GNUG__) || defined(__CUDACC__)

If so, that makes sense. I know that clang sets __GNUC__, so MSVC might do the same.

All other uses of both __memorder_vecmem_to_builtin and __atomic_fetch_add are guarded by the VECMEM_HAVE_MEMORDER_DEFINITIONS and VECMEM_HAVE_BUILTIN_ATOMIC_FETCH_ADD, which should be defined correctly... :thinking:

stephenswat commented 1 month ago

Hmm, no it doesn't: https://godbolt.org/z/o65TPMGPa

stephenswat commented 1 month ago

Okay, that's super weird. I'm not really familiar with how CUDA compilation works with MSVC, but the line indicated in the error (187) is the following:

#elif defined(VECMEM_HAVE_BUILTIN_ATOMIC_LOAD) && \
    defined(VECMEM_HAVE_MEMORDER_DEFINITIONS)
    return __atomic_load_n(m_ptr, __memorder_vecmem_to_builtin(order));

where VECMEM_HAVE_BUILTIN_ATOMIC_LOAD is defined iff:

#if defined __has_builtin
#if __has_builtin(__atomic_load_n)
#define VECMEM_HAVE_BUILTIN_ATOMIC_LOAD
#endif
#endif

and VECMEM_HAVE_MEMORDER_DEFINITIONS is defined at the same time as the __memorder_vecmem_to_builtin function:

#if defined(__ATOMIC_RELAXED) && defined(__ATOMIC_CONSUME) && \
    defined(__ATOMIC_ACQUIRE) && defined(__ATOMIC_RELEASE) && \
    defined(__ATOMIC_ACQ_REL) && defined(__ATOMIC_SEQ_CST)
constexpr int __memorder_vecmem_to_builtin(memory_order o) {
    switch (o) {
        case memory_order::relaxed:
            return __ATOMIC_RELAXED;
        case memory_order::consume:
            return __ATOMIC_CONSUME;
        case memory_order::acquire:
            return __ATOMIC_ACQUIRE;
        case memory_order::release:
            return __ATOMIC_RELEASE;
        case memory_order::acq_rel:
            return __ATOMIC_ACQ_REL;
        case memory_order::seq_cst:
            return __ATOMIC_SEQ_CST;
        default:
            assert(false);
            return 0;
    }
}
#define VECMEM_HAVE_MEMORDER_DEFINITIONS
#endif

Does MSVC + CUDA work differently than on Linux, where the nvcc compiler driver causes the preprocessor definitions to be evaluated twice; once for the host code and the device code? If this differs somehow, that could explain the problem.

stephenswat commented 1 month ago

Nevermind, I think I know what's going on.

acts-project / vecmem

Atomic Failure on Windows with CUDA, main branch (2024.08.03.) #288