Closed slava77 closed 8 months ago
Note to self: needs to be rebased
Note to self: needs to be rebased
now rebased
If things still do not work, could you remind me what you see at the output, i.e. which part of the buffers doesn't get properly propagated? In this case, debugging will be needed as I see no smoking gun issue.
after rebase I have an error
[10] Calling method for module LSTProducer@alpaka/'lstProducer'
Exception Message:
A std::exception was thrown.
...
returned error : 'cudaErrorInvalidDevice': 'invalid device ordinal'!
which I don't know yet if it's related to my ES work or some mess up I have in the configuration
@makortel
I'm trying to debug a setup with dual external build (lst_gpu
and lst_cuda
). The job crashes in some not very descriptive place in the framework (Tracer says it's right after ++ starting: begin job
and igtrace backtrace is below.
Do you have a suggestion how to debug further? I'm guessing I need some finer tracer details.
This is 13_3_0_pre3 with LST extras
06-Dec-2023 07:37:09.45 PST ++ starting: begin job
A fatal system signal has occurred: segmentation violation
#5 0x00007fa67c475332 in __strcmp_avx2 () from /lib64/libc.so.6
#6 0x00007fa67ef031ed in __gnu_cxx::__normal_iterator<edm::eventsetup::EventSetupRecordKey const*, ...
#7 0x00007fa67ede4f1c in edm::eventsetup::ESRecordsToProductResolverIndices::indexInRecord
#8 0x00007fa67edde357 in edm::EDConsumerBase::updateLookup
#9 0x00007fa67eef2246 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::updateLookup
#10 0x00007fa67eece8b6 in edm::WorkerManager::beginJob
the commit range that introduces the dual cpu/cuda build and also the ES details are in https://github.com/SegmentLinking/cmssw/compare/a0caa0f45fec1fbc5cb6889ebe19d46eca379e51...cbfc1aa6351dbecafb830ffd5bddcec0362dad53
@slava77 Does the crash happen when running on CPU, or when running on GPU?
I also assume that lst_cpu
and lst_cuda
contain same symbols, i.e. only one of them can be loaded into a process. Is this correct?
Is lst_headers
just providing the headers (that have the same content in both lst_cpu
and lst_cuda
)?
@slava77 Does the crash happen when running on CPU, or when running on GPU?
it's a default alpaka for LST modules, so I guess it's GPU.
I also assume that
lst_cpu
andlst_cuda
contain same symbols, i.e. only one of them can be loaded into a process. Is this correct?Is
lst_headers
just providing the headers (that have the same content in bothlst_cpu
andlst_cuda
)?
lst_headers
package does not provide a library
<tool name="lst_headers" version="1.0">
<client>
<environment name="LSTBASE" default="$CMSSW_BASE/external/lst/TrackLooper"/>
<environment name="INCLUDE" default="$LSTBASE"/>
</client>
<runtime name="LST_BASE" value="$LSTBASE"/>
</tool>
I also assume that
lst_cpu
andlst_cuda
contain same symbols, i.e. only one of them can be loaded into a process. Is this correct?
I didn't check. The alpaka kernels will be different symbols, but yes, there can be the same symbols for non-dev-templated code
Recall that the code without my ES-related commits (up to 72edb8034a0feb84c43f35eb8af3b2974856210d in the diff runs
with debug symbols added I found that the crash is in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsCreator>
during calling edm::EDConsumerBase::updateLookup
In it, the crash is m_esTokenInfo
in it's first element
m_record = {type_ = {
<edm::TypeIDBase> = {t_ = 0x7fff1edb2fa0},
m_name = 0x8e44 <error: Cannot access memory at address 0x8e44>}},
m_key = {type_ = {<edm::TypeIDBase> = {
t_ = 0x7fff94c048a8 <typeinfo for TrackerGeometry>},
m_name = 0x7fff94c00066 "TrackerGeometry"}
So, oddly enough the ES record detail gets corrupted and it's in a non-LST module.
I could guess though that it's in the TrackerRecoGeometryRecord
(the new ES consumed in the LSTProducer.cc).
Where is the place upstream to check the origin of corruption?
Thanks @slava77. Going with the hypothesis the cause would be an ODR violation, I'd suggest the following in order to move forward (it's a little bit nasty, bit could be technically ok for now, given that the SDL::modulesBuffer<T>
ES products are an implementation detail of the LST plugins, and as long as no-one tries to run the CPU and GPU versions simultaneously in one job)
Replace RecoTracker/LST/src/ES_ModulesDev.cc
and RecoTracker/LST/src/alpaka/ES_ModulesDev.cc
with RecoTracker/LST/plugins/alpaka/ES_ModulesDev.cc
that contains
#include <SDL/Module.h>
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"
// Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins,
// whereas the (non-host-)Device instantiation only in the GPU plugin
TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer);
Move the specialization of cms::alpakatools::CopyToDevice<SDL::modulesBuffer<alpaka_common::DevHost>>
to RecoTracker/LST/plugins/alpaka/LSTModulesDevESProducer.cc
and remove RecoTracker/LST/interface/LSTModulesDev.h
(or move the header to ../plugins/alpaka
directory).
Remove <use name="lst_cpu"/>
from RecoTrackerLSTPlugins
in RecoTracker/LST/plugins/BuildFile.xml
.
Then a higher level question: are you using ESProducer only for the purpose to copy the data to all devices, or is it foreseen for the ES product to be used (shared) by many EDModules?
Going with the hypothesis the cause would be an ODR violation, I'd suggest the following ... and as long as no-one tries to run the CPU and GPU versions simultaneously in one job
Don't we have a more substantive problem though:
libsdl_cpu.so and libsdl_cuda.so contain overlapping sets of symbols.
I can confirm by running a version before my changes that gdb info sharedlib
shows both libsdl_cpu.so and libsdl_cuda.so are loaded.
Or is this safe in case library dependencies for plugins are somehow isolated?
Going with the hypothesis the cause would be an ODR violation, I'd suggest the following ... and as long as no-one tries to run the CPU and GPU versions simultaneously in one job
Don't we have a more substantive problem though:
- pluginRecoTrackerLSTPlugins.so includes LSTOutputConverter.cc, LSTPixelSeedInputProducer.cc, LSTPhase2OTHitsInputProducer.cc and depends on libsdl_cpu.so
- pluginRecoTrackerLSTPluginsPortableCudaCudaAsync.so includes LSTModulesDevESProducer.cc, LSTProducer.cc and depends on libsdl_cuda.so
Good point, already this setup should be a problem. For time being (in order to make progress, but to be undoed in the future) it would be ok to move the LSTOutputConverter.cc
, LSTPixelSeedInputProducer.cc
, and LSTPhase2OTHitsInputProducer.cc
from plugins/
to plugins/alpaka/
.
libsdl_cpu.so and libsdl_cuda.so contain overlapping sets of symbols. I can confirm by running a version before my changes that gdb
info sharedlib
shows both libsdl_cpu.so and libsdl_cuda.so are loaded.
I forgot to ask earlier if these "same symbols" are
libsdl_cpu.so
and CUDA-specific code in case of libsdl_cuda.so
)I have been assuming the latter (and now I can't remember if it is just my assumption, or if we discussed about this detail in the hackathon).
Or is this safe in case library dependencies for plugins are somehow isolated?
No, library dependencies via plugins do not get any "extra safety". The library setup is somewhat different in that case though (like the loading order), so if the problem is really in ODR violation (which is undefined behavior), I could believe one setup to look like it works, and the other to fail in strange way.
I forgot to ask earlier if these "same symbols" are
* same symbols (functions) containing the same code, or * same symbols (functions) containing different code (e.g. CPU-specific code in case of `libsdl_cpu.so` and CUDA-specific code in case of `libsdl_cuda.so`)
both kind; there are quite a few helper methods that are independent of the accelerator type
@makortel
I see that ASAN_OPTIONS=detect_odr_violation=2
(or 1) could be used.
Even though now it's obvious that the problem is there, it could be something useful to weed out the issues.
Is there a way to run it with scram in the regular build or do I need to rebuild in the ASAN build?
Replace
RecoTracker/LST/src/ES_ModulesDev.cc
andRecoTracker/LST/src/alpaka/ES_ModulesDev.cc
withRecoTracker/LST/plugins/alpaka/ES_ModulesDev.cc
that contains#include <SDL/Module.h> #include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h" // Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins, // whereas the (non-host-)Device instantiation only in the GPU plugin TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>); TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer);
this does not compile:
In file included from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/ESDeviceProduct.h:8,
from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h:5,
from $CMSSW_BASE/src/RecoTracker/LST/src/alpaka/ES_ModulesDev.cc:2:
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:103:45: error: template argument 2 is invalid
103 | using Acc = alpaka::AccCpuSerial<TDim, Idx>;
| ^
@makortel I see that
ASAN_OPTIONS=detect_odr_violation=2
(or 1) could be used. Even though now it's obvious that the problem is there, it could be something useful to weed out the issues. Is there a way to run it with scram in the regular build or do I need to rebuild in the ASAN build?
I don't really know, but I'd guess the full ASAN build would be safer. I wonder if we already use detect_odr_violation
in the ASAN IB.
Replace
RecoTracker/LST/src/ES_ModulesDev.cc
andRecoTracker/LST/src/alpaka/ES_ModulesDev.cc
withRecoTracker/LST/plugins/alpaka/ES_ModulesDev.cc
that contains#include <SDL/Module.h> #include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h" // Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins, // whereas the (non-host-)Device instantiation only in the GPU plugin TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>); TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer);
this does not compile:
In file included from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/ESDeviceProduct.h:8, from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h:5, from $CMSSW_BASE/src/RecoTracker/LST/src/alpaka/ES_ModulesDev.cc:2: /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:103:45: error: template argument 2 is invalid 103 | using Acc = alpaka::AccCpuSerial<TDim, Idx>; | ^
Mmkay, the error looks strange, as the Idx
should be defined in the config.h
. Does the compilation log have any other failures?
I don't really know, but I'd guess the full ASAN build would be safer. I wonder if we already use
detect_odr_violation
in the ASAN IB.
https://github.com/cms-sw/cmssw-config/blob/7061c60a2606087a9f8ff456d3afc4bbc78a0fca/Projects/CMSSW/Self.xml#L57
detect_odr_violation=0
Mmkay, the error looks strange, as the
Idx
should be defined in theconfig.h
. Does the compilation log have any other failures?
sorry, I stopped on the first part of the error message:
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:103:45: error: template argument 2 is invalid
103 | using Acc = alpaka::AccCpuSerial<TDim, Idx>;
| ^
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:104:17: error: 'Acc' {aka 'alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 3>, long unsigned int>'} is not a template
104 | using Acc1D = Acc<Dim1D>;
| ^~~
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:105:17: error: 'Acc' {aka 'alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 3>, long unsigned int>'} is not a template
105 | using Acc2D = Acc<Dim2D>;
| ^~~
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:106:17: error: 'Acc' {aka 'alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 3>, long unsigned int>'} is not a template
106 | using Acc3D = Acc<Dim3D>;
| ^~~
there is also a GPU variant of the same set (not included above)
Mmkay, the error looks strange, as the
Idx
should be defined in theconfig.h
. Does the compilation log have any other failures?sorry, I stopped on the first part of the error message:
Thanks, although the rest didn't really help. I tested the following
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"
template <typename Dev>
struct FooTest {
int foo;
};
TYPELOOKUP_DATA_REG(FooTest<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(FooTest);
as a HeterogeneousCore/AlpakaTest/plugins/alpaka/foo.cc
and that compiled fine in 13_3_0_pre3. Does SDL::modulesBuffer<T>
do something special?
I was able to reproduce the compilation error by adding using Idx = std::size_t;
before the #include ".../typelookup.h
(which is what is done in https://github.com/SegmentLinking/TrackLooper/blob/master/SDL/Constants.h#L25). The setup is also easy to reproduce in compiler explorer https://godbolt.org/z/heMbdTaqr.
I'd suggest to enclose those type aliases in SDL into a namespace. Or open an issue in CMSSW GitHub that the setup in config.h
is somewhat brittle. Or do both. Or reorganize the code in SDL such that only the modulesBuffer
gets exposed to CMSSW. Or reorder the includes so that SDL/Module.h
comes last.
I'd suggest to enclose those type aliases in SDL into a namespace.
I moved to namespace SDL
in 10fe919
I updated the PR description now that this seems to run
This is coupled with SegmentLinking/TrackLooper#355 where
SDL::modulesBuffer
for its const ES-like part was supplanted bySDL::modulesBufferES
, which is not owned by SDL Event. This allows to create an instance of amodulesBuffer
in CMSSW side as an ES product which can be consumed directly in the SDL code.