Closed makortel closed 1 year ago
assign core, heterogeneous
New categories assigned: heterogeneous,core
@fwyzard,@Dr15Jones,@smuzaffar,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Found this
- If you use dlopen to explicitly load code from a shared library, you must do several things. First, export global symbols from the executable by linking it with the "-E" flag (you will have to specify this as "-Wl,-E" if you are invoking the linker in the usual manner from the compiler driver, g++). You must also make the external symbols in the loaded library available for subsequent libraries by providing the RTLD_GLOBAL flag to dlopen. The symbol resolution can be immediate or lazy.
Template instantiations are another, user visible, case of objects with vague linkage, which needs similar resolution. If you do not take the above precautions, you may discover that a template instantiation with the same argument list, but instantiated in multiple translation units, has several addresses, depending in which translation unit the address is taken. (This is not an exhaustive list of the kind of objects which have vague linkage and are expected to be resolved during linking & loading.)
from gcc documentation https://gcc.gnu.org/faq.html#dso. Also two open tickets that might be connected
In this case in both code paths leading to the CachingAllocator
the underlying shared objects (pluginHeterogeneousCoreAlpakaServicesPluginsCudaAsync.so
, pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
) are loaded with dlopen()
. The dlopen()
is called in
https://github.com/cms-sw/cmssw/blob/c81258cf188a2ee7e3e86abec514a72e11348f0b/FWCore/PluginManager/src/SharedLibrary.cc#L34-L35
I was surprised to see that without setting SCRAM_ARCH
on lxplus-gpu
the el8_amd64_gcc10
gets picked up, and running it actually works. With el8_amd64_gcc11
I see the same behavior as in the issue description.
So, it seems to be an issue with GCC 11 ?
(not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11)
We see a low rate of segfaults that look to be dlopen related, so it seems at least plausible that this has been around for a while.
I wanted to see the behavior with the serial_sync
backend (to see if gcc
or nvcc
in the link would make any difference). I hacked this piece
https://github.com/cms-sw/cmssw/blob/ed920883f8da116c2aec30e08f46e43c00851989/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h#L26-L35
to actually use the caching allocator, and see similar behavior (i.e. I see two allocator instances created from AlpakaService
, one in getHostCachingAllocator()
and another in getDeviceCachingAllocator()
; and additional one via the PortableHostCollection
constructor in the EDModule). This test case does not segfault though. I suppose the difference wrt. the cuda_async
backend is that in the CUDA case the CUDAService
destructor resets the CUDA runtime, which will cause the cudaEventDestroy()
to crash.
(this test was done with el9_amd64_gcc11
)
So, it seems to be an issue with GCC 11 ?
(not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11)
As far as I can tell this behavior seems specific to GCC 11. On the same lxplus-gpu
node, using el8_amd64_gcc10
works, whereas el8_amd64_gcc11
and el9_amd64_gcc11
crash.
- First, export global symbols from the executable by linking it with the "-E" flag (you will have to specify this as "-Wl,-E" if you are invoking the linker in the usual manner from the compiler driver, g++). You must also make the external symbols in the loaded library available for subsequent libraries by providing the RTLD_GLOBAL flag to dlopen.
I think we are doing both:
$ scram tool info gcc-cxxcompiler | grep Wl
CXXSHAREDFLAGS+=-shared -Wl,-E
LDFLAGS+=-Wl,-E -Wl,--hash-style=gnu
One thing I noticed while investigating this issue was that the https://github.com/cms-sw/cmssw/blob/ed920883f8da116c2aec30e08f46e43c00851989/HeterogeneousCore/AlpakaTest/test/BuildFile.xml#L1 test is not run in GPU_X IBs. @smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10
)
I also noticed we are not running IBs for *_ppc64le_gcc11
(which might have revealed this issue as well)
are there also two copies of cms::alpakatools::host()::host
?
[innocent@lxplus8s12 CMSSW_12_6_X_2022-10-20-1100]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el8_amd64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-10-17-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
vs
[innocent@lxplus9s00 ~]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016380 b guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
00000000000163a0 b cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
'u' is good
"u"
The symbol is a unique global symbol. This is a GNU extension to the standard set of ELF symbol bindings. For such a symbol the dynamic linker will make sure that in the entire process there is just one symbol with this name and type in use.
b is bad (for this case)
while
[innocent@lxplus8s12 CMSSW_12_6_X_2022-10-20-1100]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el8_amd64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-10-17-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'host$'
0000000000016370 b guard variable for cms::alpakatools::host()::host
0000000000016380 b cms::alpakatools::host()::host
vs
[innocent@lxplus9s00 ~]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'host$'
0000000000016480 b guard variable for cms::alpakatools::host()::host
0000000000016490 b cms::alpakatools::host()::host
being 'b' and lower case I do expect two copies of this one (local instance)
@smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)
Currently, for GPU builds we only run tests for packages which has direct cuda
dependency. We can also add package with direct alpaka
dependency too
I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)
We only have two ppc64le
nodes which are already overloaded due to various IBs/Release and PR tests. So I am afraid we can not add ppc64le_gcc11
@smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)
Currently, for GPU builds we only run tests for packages which has direct
cuda
dependency. We can also add package with directalpaka
dependency too
Thanks. In this case the HeterogeneousCore/AlpakaTest/test/BuildFile.xml
does not declare any dependencies (because the runs a script that runs cmsRun
). Should the <test/>
be made dependent on alpaka
? Or would an overall dependence on alpaka
be sufficient?
There is also
https://github.com/cms-sw/cmssw/blob/42baf3c5016a1f79556bd29b50141a13d7c65b98/HeterogeneousCore/AlpakaInterface/test/BuildFile.xml#L1-L6
where an executable is compiled for each Alpaka backend. For now running the resulting alpakaTestVecCudaAsync
and alpakaTestVecSerialSync
on a non-GPU machine or on a GPU machine works (because no device code gets run). But maybe it would make more sense to run the alpakaTestVecSerialSync
only in non-GPU IB (or maybe in both?) and alpakaTestVecCudaAsync
only in GPU IB?
Or would the whole GPU unit test setup benefit from some generalizations (thinking towards adding support for AMD GPUs etc)?
I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)
We only have two
ppc64le
nodes which are already overloaded due to various IBs/Release and PR tests. So I am afraid we addppc64le_gcc11
yet
Yeah, makes sense.
Thanks @VinInn.
being 'b' and lower case I do expect two copies of this one (local instance)
Indeed, inserting a printout in cms::alpakatools::detail::enumerate_host()
shows multiple printouts (3 in my test, which I presume are from libDataFormatsPortableTestObjectsSerialSync.so
, libHeterogeneousCoreAlpakaServicesSerialSync.so
, pluginHeterogeneousCoreAlpakaTestPluginsPortableSerialSync.so
). And this happens also with gcc 10 (as your nm
outputs pointed out).
@makortel , direct dependency with in the package ( either in Package/BuildFile.xml
or Package/{plugins,test,bin}/BuildFile.xml
) is good enough for bot to check out that package and run all its tests. Note that we run all tests for these package (just in case if there are tests which process the output of some GPU based test).
looking to the code I would have not expected multiple copies of "host" either. somehow those functions get a "hidden" linking. Maybe we should try to write a small demonstrator.
OK for host()
is obvious: It is declared static
that in old good C means local to the compilation unit!
OK for
host()
is obvious: It is declaredstatic
that in old good C means local to the compilation unit!
Good catch, thanks! @fwyzard, do you have any objections for removing the static
from
https://github.com/cms-sw/cmssw/blob/c81258cf188a2ee7e3e86abec514a72e11348f0b/HeterogeneousCore/AlpakaInterface/interface/host.h#L24
(I believe the replication was not intentional)
You're right, I'll make a PR.
there is a gcc 11.3 out. maybe would be worth to test it to see if this issue has been fixed
spent the afternoon in some monkey typing. Eventually
[innocent@lxplus9s00 src]$ git diff
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..8f0a669b7e7 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,8 +83,8 @@ namespace cms::alpakatools {
*/
template <typename TDev,
- typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename TQueue> //,
+// typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
do not ask. It is fact (and a bug in the compiler, probably)
more
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..af3efa974a5 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> >> // and cms::alpakatools::is_queue_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..3faf423b865 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,9 @@ namespace cms::alpakatools {
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t<
+ // cms::alpakatools::is_device_v<TDev> >> and
+ cms::alpakatools::is_queue_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
This is ok as well
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..2404d639246 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -82,9 +82,15 @@ namespace cms::alpakatools {
* - the `Queue` type can be either `Sync` _or_ `Async` on any allocation.
*/
+ template <typename TDev,
+ typename TQueue>
+ constexpr bool isGood = cms::alpakatools::is_device_v<TDev> and
+ cms::alpakatools::is_queue_v<TQueue>;
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t< isGood<TDev,TQueue>>>
+ // cms::alpakatools::is_device_v<TDev> >> and
+ // cms::alpakatools::is_queue_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
btw the default compiler is
gcc version 11.3.1 20220421 (Red Hat 11.3.1-2) (GCC)
This also works:
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..a125c98996b8 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,8 @@ namespace cms::alpakatools {
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev>>,
+ typename = std::enable_if_t<cms::alpakatools::is_queue_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
on el8_amd64_gcc11
:
$ nm -C ../lib/el8_amd64_gcc11/lib*.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000011380 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
00000000000113a0 u cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
and cmsRun HeterogeneousCore/AlpakaTest/test/writer.py
works fine.
This also works:
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..d8a25cce69d0 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t<std::is_object_v<TDev> and std::is_object_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
(it's not a useful implementation, it's just a check)
also this works (and IMHO , as argued in the above issue, I find it also a more adequate syntax)
[innocent@lxplus9s00 src]$ git diff
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..551bc948853 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,9 +83,9 @@ namespace cms::alpakatools {
*/
template <typename TDev,
- typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename TQueue>
class CachingAllocator {
+ static_assert(cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>);
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
friend class alpaka_cuda_async::AlpakaService;
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/lib*.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000011380 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
00000000000113a0 u cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
0000000000016458 u guard variable for cms::cuda::allocator::getCachingHostAllocator()::allocator
0000000000016340 u guard variable for cms::cuda::allocator::getCachingDeviceAllocator()::allocator
0000000000016460 u cms::cuda::allocator::getCachingHostAllocator()::allocator
0000000000016360 u cms::cuda::allocator::getCachingDeviceAllocator()::allocator
But this would give a compile time error instead of failing to match the template parameters (OK, which would also give a different compile time error in this case).
By the way, a curious observation: in my tests getCachingDeviceAllocator()::allocator
is always compiled correctly (in terms of what we want it to do), only getCachingHostAllocator()::allocator
shows the different behaviour in GCC 11 vs GCC 10.
Ah, and just to exclude a possible culprit, I think the assembler and linker are not at fault here, the difference is there already in the GCC assembly output:
.weak _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
.section .bss._ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,"awG",@nobits,_ZZN3cms11a
.align 32
.type _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, @gnu_unique_object
.size _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, 224
_ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator:
.zero 224
.local _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
.comm _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,224,32
yep. To report a bug (regression) one needs to reproduce it in a simpler example. I've failed until now.
btw: why such a complex machinery to construct and destruct a bunch of allocators? (in HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h) a simple std::vector would not have been enough?
OK, I think I found the source of the problem. In HeterogeneousCore/AlpakaInterface/interface/traits.h
I used
template <typename T>
constexpr bool is_device_v = is_device<T>::value;
instead of
template <typename T>
inline constexpr bool is_device_v = is_device<T>::value;
So, as explained on StackOverflow, is_device_v
is a local variable, and it looks like recent versions of GCC make the template specialisation based on it local as well.
Adding the missing inline
to the declarations fixes the problem.
Fixed by #39826.
btw: why such a complex machinery to construct and destruct a bunch of allocators? (in HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h) a simple std::vector would not have been enough?
Because a CachingAllocator
does not have a default constructor, and is not copyable or movable.
Doing make_unique<Allocator[]>(size)
does not work either, because of the lack of a default constructor.
I think a std::vector<std::unique_ptr<Allocator>>
should work.
Adding the missing
inline
to the declarations fixes the problem.
Ok, it may make sense. Not easy to reproduce with a small example anyhow.
Ok, it may make sense. Not easy to reproduce with a small example anyhow.
If you want to play with it, this is the reproducer I manage to cook:
test.cc
#include <type_traits>
// concepts
namespace concepts {
//! Tag used in class inheritance hierarchies that describes that a specific concept (TConcept)
//! is implemented by the given base class (TBase).
template <typename TConcept, typename TBase>
struct Implements {};
//! Checks whether the concept is implemented by the given class
template <typename TConcept, typename TDerived>
struct ImplementsConcept {
template <typename TBase>
static auto implements(Implements<TConcept, TBase>&) -> std::true_type;
static auto implements(...) -> std::false_type;
static constexpr auto value = decltype(implements(std::declval<TDerived&>()))::value;
};
} // namespace concepts
struct ConceptDev;
struct ConceptQueue;
// traits
template <typename T>
using is_device = concepts::ImplementsConcept<ConceptDev, T>;
template <typename T>
/* inline */ constexpr bool is_device_v = is_device<T>::value;
template <typename T>
using is_queue = concepts::ImplementsConcept<ConceptQueue, T>;
template <typename T>
/* inline */ constexpr bool is_queue_v = is_queue<T>::value;
// host
class DevCpu : public concepts::Implements<ConceptDev, DevCpu> {
private:
void* impl = nullptr;
};
// queue
class QueueCpu : public concepts::Implements<ConceptQueue, QueueCpu> {
private:
void* impl = nullptr;
};
// allocator
template <typename TDev, typename TQueue, typename = std::enable_if_t<is_device_v<TDev> and is_queue_v<TQueue>>>
//template <typename TDev, typename TQueue, typename = std::enable_if_t<concepts::ImplementsConcept<ConceptDev, TDev>::value and concepts::ImplementsConcept<ConceptQueue, TQueue>::value>>
class Allocator {
public:
Allocator(TDev const& dev) : dev{dev} {}
private:
TDev dev;
};
// access the allocator
template <typename TQueue, typename = std::enable_if_t<is_queue_v<TQueue>>>
//template <typename TQueue, typename = std::enable_if_t<concepts::ImplementsConcept<ConceptQueue, TQueue>::value>>
inline Allocator<DevCpu, TQueue>& getHostAllocator() {
static Allocator<DevCpu, TQueue> allocator{DevCpu{}};
return allocator;
}
void test() { getHostAllocator<QueueCpu>(); }
Makefile
.PHONY: all dump clean
all: dump
dump: dump.gcc10 dump.gcc11 dump.gcc12 dump.clang12
clean:
rm -f test.gcc* test.clang*
CXXFLAGS := -std=c++17 -O2 -ftree-vectorize -msse3 -fPIC -pthread
# gcc 10
test.gcc10.ii: test.cc
g++-10 $(CXXFLAGS) $< -E -o $@
test.gcc10.s: test.gcc10.ii
g++-10 $(CXXFLAGS) $< -S -o $@
test.gcc10.o: test.gcc10.s
g++-10 $(CXXFLAGS) $< -c -o $@
dump.gcc10: test.gcc10.o
nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'
# gcc 11
test.gcc11.ii: test.cc
g++-11 $(CXXFLAGS) $< -E -o $@
test.gcc11.s: test.gcc11.ii
g++-11 $(CXXFLAGS) $< -S -o $@
test.gcc11.o: test.gcc11.s
g++-11 $(CXXFLAGS) $< -c -o $@
dump.gcc11: test.gcc11.o
nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'
# gcc 12
test.gcc12.ii: test.cc
g++-12 $(CXXFLAGS) $< -E -o $@
test.gcc12.s: test.gcc12.ii
g++-12 $(CXXFLAGS) $< -S -o $@
test.gcc12.o: test.gcc12.s
g++-12 $(CXXFLAGS) $< -c -o $@
dump.gcc12: test.gcc12.o
nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'
# clang 12
test.clang12.ii: test.cc
clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -E -o $@
test.clang12.s: test.clang12.ii
clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -S -o $@
test.clang12.o: test.clang12.s
clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -c -o $@
dump.clang12: test.clang12.o
nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'
By the way, clang generates a V
weak object, possibly with a default value (both with and without the inline
).
Thanks. Useful to have a self contained example.
I noticed something peculiar on
lxplus-gpu
that is nowadayscs9
. It can be reproduced with (e.g.)which results in
The stack trace of the segfault is
i.e. from
cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::QueueCudaRtNonBlocking>, void>::freeAllCached()
, but called from theCachingAllocator
destructor rather than from https://github.com/cms-sw/cmssw/blob/ed4c9be228a8f850f230246336fdc094530ae9d8/HeterogeneousCore/AlpakaServices/src/alpaka/AlpakaService.cc#L80-L82With gdb I saw that the job has actually two
CachingAllocator
objects alive. Setting a breakpoint in theCachingAllocator
constructor showsi.e. it ends up being constructed the first time from https://github.com/cms-sw/cmssw/blob/ed4c9be228a8f850f230246336fdc094530ae9d8/HeterogeneousCore/AlpakaServices/src/alpaka/AlpakaService.cc#L75 and the second time from https://github.com/cms-sw/cmssw/blob/c81258cf188a2ee7e3e86abec514a72e11348f0b/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h#L48 Both call https://github.com/cms-sw/cmssw/blob/c81258cf188a2ee7e3e86abec514a72e11348f0b/HeterogeneousCore/AlpakaInterface/interface/getHostCachingAllocator.h#L14-L16 which should guarantee only a single object to be constructed