SegmentLinking / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1 stars 2 forks source link

SDL::modulesBuffer as an ES product #16

Closed slava77 closed 8 months ago

slava77 commented 11 months ago

This is coupled with SegmentLinking/TrackLooper#355 where SDL::modulesBuffer for its const ES-like part was supplanted by SDL::modulesBufferES, which is not owned by SDL Event. This allows to create an instance of a modulesBuffer in CMSSW side as an ES product which can be consumed directly in the SDL code.

slava77 commented 11 months ago

Note to self: needs to be rebased

slava77 commented 10 months ago

Note to self: needs to be rebased

now rebased

slava77 commented 10 months ago

If things still do not work, could you remind me what you see at the output, i.e. which part of the buffers doesn't get properly propagated? In this case, debugging will be needed as I see no smoking gun issue.

after rebase I have an error

[10] Calling method for module LSTProducer@alpaka/'lstProducer'
Exception Message:
A std::exception was thrown.
...
returned error  : 'cudaErrorInvalidDevice': 'invalid device ordinal'!

which I don't know yet if it's related to my ES work or some mess up I have in the configuration

slava77 commented 9 months ago

@makortel

I'm trying to debug a setup with dual external build (lst_gpu and lst_cuda). The job crashes in some not very descriptive place in the framework (Tracer says it's right after ++ starting: begin job and igtrace backtrace is below. Do you have a suggestion how to debug further? I'm guessing I need some finer tracer details. This is 13_3_0_pre3 with LST extras

06-Dec-2023 07:37:09.45 PST  ++ starting: begin job
A fatal system signal has occurred: segmentation violation
#5  0x00007fa67c475332 in __strcmp_avx2 () from /lib64/libc.so.6
#6  0x00007fa67ef031ed in __gnu_cxx::__normal_iterator<edm::eventsetup::EventSetupRecordKey const*, ...
#7  0x00007fa67ede4f1c in edm::eventsetup::ESRecordsToProductResolverIndices::indexInRecord
#8  0x00007fa67edde357 in edm::EDConsumerBase::updateLookup
#9  0x00007fa67eef2246 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::updateLookup
#10 0x00007fa67eece8b6 in edm::WorkerManager::beginJob

the commit range that introduces the dual cpu/cuda build and also the ES details are in https://github.com/SegmentLinking/cmssw/compare/a0caa0f45fec1fbc5cb6889ebe19d46eca379e51...cbfc1aa6351dbecafb830ffd5bddcec0362dad53

makortel commented 9 months ago

@slava77 Does the crash happen when running on CPU, or when running on GPU?

makortel commented 9 months ago

I also assume that lst_cpu and lst_cuda contain same symbols, i.e. only one of them can be loaded into a process. Is this correct?

Is lst_headers just providing the headers (that have the same content in both lst_cpu and lst_cuda)?

slava77 commented 9 months ago

@slava77 Does the crash happen when running on CPU, or when running on GPU?

it's a default alpaka for LST modules, so I guess it's GPU.

slava77 commented 9 months ago

I also assume that lst_cpu and lst_cuda contain same symbols, i.e. only one of them can be loaded into a process. Is this correct?

Is lst_headers just providing the headers (that have the same content in both lst_cpu and lst_cuda)?

lst_headers package does not provide a library

<tool name="lst_headers" version="1.0">
  <client>
    <environment name="LSTBASE" default="$CMSSW_BASE/external/lst/TrackLooper"/>
    <environment name="INCLUDE" default="$LSTBASE"/>
  </client>
  <runtime name="LST_BASE" value="$LSTBASE"/>
</tool>
slava77 commented 9 months ago

I also assume that lst_cpu and lst_cuda contain same symbols, i.e. only one of them can be loaded into a process. Is this correct?

I didn't check. The alpaka kernels will be different symbols, but yes, there can be the same symbols for non-dev-templated code

Recall that the code without my ES-related commits (up to 72edb8034a0feb84c43f35eb8af3b2974856210d in the diff runs

slava77 commented 9 months ago

with debug symbols added I found that the crash is in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsCreator> during calling edm::EDConsumerBase::updateLookup In it, the crash is m_esTokenInfo in it's first element

m_record = {type_ = {
<edm::TypeIDBase> = {t_ = 0x7fff1edb2fa0}, 
m_name = 0x8e44 <error: Cannot access memory at address 0x8e44>}}, 
m_key = {type_ = {<edm::TypeIDBase> = {
        t_ = 0x7fff94c048a8 <typeinfo for TrackerGeometry>}, 
m_name = 0x7fff94c00066 "TrackerGeometry"}

So, oddly enough the ES record detail gets corrupted and it's in a non-LST module. I could guess though that it's in the TrackerRecoGeometryRecord (the new ES consumed in the LSTProducer.cc). Where is the place upstream to check the origin of corruption?

makortel commented 9 months ago

Thanks @slava77. Going with the hypothesis the cause would be an ODR violation, I'd suggest the following in order to move forward (it's a little bit nasty, bit could be technically ok for now, given that the SDL::modulesBuffer<T> ES products are an implementation detail of the LST plugins, and as long as no-one tries to run the CPU and GPU versions simultaneously in one job)

Replace RecoTracker/LST/src/ES_ModulesDev.cc and RecoTracker/LST/src/alpaka/ES_ModulesDev.cc with RecoTracker/LST/plugins/alpaka/ES_ModulesDev.cc that contains

#include <SDL/Module.h>
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"

// Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins,
// whereas the (non-host-)Device instantiation only in the GPU plugin
TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer);

Move the specialization of cms::alpakatools::CopyToDevice<SDL::modulesBuffer<alpaka_common::DevHost>> to RecoTracker/LST/plugins/alpaka/LSTModulesDevESProducer.cc and remove RecoTracker/LST/interface/LSTModulesDev.h (or move the header to ../plugins/alpaka directory).

Remove <use name="lst_cpu"/> from RecoTrackerLSTPlugins in RecoTracker/LST/plugins/BuildFile.xml.


Then a higher level question: are you using ESProducer only for the purpose to copy the data to all devices, or is it foreseen for the ES product to be used (shared) by many EDModules?

slava77 commented 9 months ago

Going with the hypothesis the cause would be an ODR violation, I'd suggest the following ... and as long as no-one tries to run the CPU and GPU versions simultaneously in one job

Don't we have a more substantive problem though:

libsdl_cpu.so and libsdl_cuda.so contain overlapping sets of symbols. I can confirm by running a version before my changes that gdb info sharedlib shows both libsdl_cpu.so and libsdl_cuda.so are loaded. Or is this safe in case library dependencies for plugins are somehow isolated?

makortel commented 9 months ago

Going with the hypothesis the cause would be an ODR violation, I'd suggest the following ... and as long as no-one tries to run the CPU and GPU versions simultaneously in one job

Don't we have a more substantive problem though:

  • pluginRecoTrackerLSTPlugins.so includes LSTOutputConverter.cc, LSTPixelSeedInputProducer.cc, LSTPhase2OTHitsInputProducer.cc and depends on libsdl_cpu.so
  • pluginRecoTrackerLSTPluginsPortableCudaCudaAsync.so includes LSTModulesDevESProducer.cc, LSTProducer.cc and depends on libsdl_cuda.so

Good point, already this setup should be a problem. For time being (in order to make progress, but to be undoed in the future) it would be ok to move the LSTOutputConverter.cc, LSTPixelSeedInputProducer.cc, and LSTPhase2OTHitsInputProducer.cc from plugins/ to plugins/alpaka/.

libsdl_cpu.so and libsdl_cuda.so contain overlapping sets of symbols. I can confirm by running a version before my changes that gdb info sharedlib shows both libsdl_cpu.so and libsdl_cuda.so are loaded.

I forgot to ask earlier if these "same symbols" are

I have been assuming the latter (and now I can't remember if it is just my assumption, or if we discussed about this detail in the hackathon).

Or is this safe in case library dependencies for plugins are somehow isolated?

No, library dependencies via plugins do not get any "extra safety". The library setup is somewhat different in that case though (like the loading order), so if the problem is really in ODR violation (which is undefined behavior), I could believe one setup to look like it works, and the other to fail in strange way.

slava77 commented 9 months ago

I forgot to ask earlier if these "same symbols" are

* same symbols (functions) containing the same code, or

* same symbols (functions) containing different code (e.g. CPU-specific code in case of `libsdl_cpu.so` and CUDA-specific code in case of `libsdl_cuda.so`)

both kind; there are quite a few helper methods that are independent of the accelerator type

slava77 commented 9 months ago

@makortel I see that ASAN_OPTIONS=detect_odr_violation=2 (or 1) could be used. Even though now it's obvious that the problem is there, it could be something useful to weed out the issues. Is there a way to run it with scram in the regular build or do I need to rebuild in the ASAN build?

slava77 commented 9 months ago

Replace RecoTracker/LST/src/ES_ModulesDev.cc and RecoTracker/LST/src/alpaka/ES_ModulesDev.cc with RecoTracker/LST/plugins/alpaka/ES_ModulesDev.cc that contains

#include <SDL/Module.h>
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"

// Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins,
// whereas the (non-host-)Device instantiation only in the GPU plugin
TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer);

this does not compile:

In file included from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/ESDeviceProduct.h:8,
                 from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h:5,
                 from $CMSSW_BASE/src/RecoTracker/LST/src/alpaka/ES_ModulesDev.cc:2:
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:103:45: error: template argument 2 is invalid
  103 |   using Acc = alpaka::AccCpuSerial<TDim, Idx>;
      |                                             ^
makortel commented 9 months ago

@makortel I see that ASAN_OPTIONS=detect_odr_violation=2 (or 1) could be used. Even though now it's obvious that the problem is there, it could be something useful to weed out the issues. Is there a way to run it with scram in the regular build or do I need to rebuild in the ASAN build?

I don't really know, but I'd guess the full ASAN build would be safer. I wonder if we already use detect_odr_violation in the ASAN IB.

makortel commented 9 months ago

Replace RecoTracker/LST/src/ES_ModulesDev.cc and RecoTracker/LST/src/alpaka/ES_ModulesDev.cc with RecoTracker/LST/plugins/alpaka/ES_ModulesDev.cc that contains

#include <SDL/Module.h>
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"

// Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins,
// whereas the (non-host-)Device instantiation only in the GPU plugin
TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer);

this does not compile:

In file included from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/ESDeviceProduct.h:8,
                 from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h:5,
                 from $CMSSW_BASE/src/RecoTracker/LST/src/alpaka/ES_ModulesDev.cc:2:
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:103:45: error: template argument 2 is invalid
  103 |   using Acc = alpaka::AccCpuSerial<TDim, Idx>;
      |                                             ^

Mmkay, the error looks strange, as the Idx should be defined in the config.h. Does the compilation log have any other failures?

slava77 commented 9 months ago

I don't really know, but I'd guess the full ASAN build would be safer. I wonder if we already use detect_odr_violation in the ASAN IB.

https://github.com/cms-sw/cmssw-config/blob/7061c60a2606087a9f8ff456d3afc4bbc78a0fca/Projects/CMSSW/Self.xml#L57 detect_odr_violation=0

slava77 commented 9 months ago

Mmkay, the error looks strange, as the Idx should be defined in the config.h. Does the compilation log have any other failures?

sorry, I stopped on the first part of the error message:

/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:103:45: error: template argument 2 is invalid
  103 |   using Acc = alpaka::AccCpuSerial<TDim, Idx>;
      |                                             ^
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:104:17: error: 'Acc' {aka 'alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 3>, long unsigned int>'} is not a template
  104 |   using Acc1D = Acc<Dim1D>;
      |                 ^~~
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:105:17: error: 'Acc' {aka 'alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 3>, long unsigned int>'} is not a template
  105 |   using Acc2D = Acc<Dim2D>;
      |                 ^~~
/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_0_pre3/src/HeterogeneousCore/AlpakaInterface/interface/config.h:106:17: error: 'Acc' {aka 'alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 3>, long unsigned int>'} is not a template
  106 |   using Acc3D = Acc<Dim3D>;
      |                 ^~~

there is also a GPU variant of the same set (not included above)

makortel commented 9 months ago

Mmkay, the error looks strange, as the Idx should be defined in the config.h. Does the compilation log have any other failures?

sorry, I stopped on the first part of the error message:

Thanks, although the rest didn't really help. I tested the following

#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"

template <typename Dev>
struct FooTest {
  int foo;
};

TYPELOOKUP_DATA_REG(FooTest<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(FooTest);

as a HeterogeneousCore/AlpakaTest/plugins/alpaka/foo.cc and that compiled fine in 13_3_0_pre3. Does SDL::modulesBuffer<T> do something special?

makortel commented 9 months ago

I was able to reproduce the compilation error by adding using Idx = std::size_t; before the #include ".../typelookup.h (which is what is done in https://github.com/SegmentLinking/TrackLooper/blob/master/SDL/Constants.h#L25). The setup is also easy to reproduce in compiler explorer https://godbolt.org/z/heMbdTaqr.

I'd suggest to enclose those type aliases in SDL into a namespace. Or open an issue in CMSSW GitHub that the setup in config.h is somewhat brittle. Or do both. Or reorganize the code in SDL such that only the modulesBuffer gets exposed to CMSSW. Or reorder the includes so that SDL/Module.h comes last.

slava77 commented 9 months ago

I'd suggest to enclose those type aliases in SDL into a namespace.

I moved to namespace SDL in 10fe919

slava77 commented 9 months ago

I updated the PR description now that this seems to run