Closed missirol closed 2 days ago
cms-bot internal usage
A new Issue was created by @missirol.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign heterogeneous, reconstruction, hlt
New categories assigned: heterogeneous,reconstruction,hlt
@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
Let's tag @cms-sw/tracking-pog-l2
The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to
"serial_sync"
(in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU).Is [1] supposed to work ?
Theoretically I'd expect it to work, at least from the framework point of view.
type tracking
@AdrianoDee FYI
Compiling with debug symbols points the crash to occur in https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/RecoTracker/PixelSeeding/plugins/alpaka/CAPixelDoubletsAlgos.h#L270
Some additional information from a debugger session
pIndex = 0
kl = 31
kk = 31
khh = 17
hoff = 256
phiBinner.off.m_v[hoff+kk] = 6504
phiBinner.content.m_capacity = 29601
so theoretically the p[0]
should be valid (p = &(phiBinner.content.m_v[phiBinner.off.m_v[hoff+kk]])
), assuming the phiBinner.content.m_v
gets set properly.
Looking then at the HitsConstView<TrackerTraits> hh
from where the phiBinner
is obtained from
hh.elements_ = 29601
# consistent with phiBinner.content.m_capacity
hh.phiBinnerStorageParameters_.addr_ = 0x7fff5393e580
phiBinner.content.m_v = 0x7fff5373e580
# phiBinner.content.m_v is exactly 2 MiB smaller than phiBinnerStorageParameters_.addr_ !
# ok, the "exactly 2 MiB" could be a coincidence
phiBinner.content.m_v
is set here
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/FlexiStorage.h#L27-L30
called from OneToManyAssocBase<...>::initStorage()
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h#L49
AFAICT initStorage()
is called only in zeroAndInit
kernel
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h#L104
and launchZero
kernel
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h#L137
I see especially the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h#L35-L45
does not call the initStorage()
, or set the phiBinner.content.m_v
in any other way.
I see the HistoContainer
unit test does call the initStorage()
after the device-to-host copy
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/test/alpaka/testHistoContainer.dev.cc#L201-L208
before inspecting the host-side data.
I think the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>
is missing the call to initStorage()
, and that leads to the phiBinner.begin()
to return a pointer to device memory, and then p[pIndex]
to segfault.
Given the comment
https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/test/alpaka/testHistoContainer.dev.cc#L201
means the the copyAsync()
function must synchronize with alpaka::wait()
before calling initStorage()
. This might be sufficient at least for subsequent testing.
For the longer term, assuming we'd want to remove this alpaka::wait()
call, it could be fairly straightforward to extend the CopyToHost
and CopyToDevice
class templates to allow a post-copy modification operation (in a way the present CopyToHost::copyAsync()
resembles the acquire()
method in ExternalWork
/SynchronizingEDProducer
, the new function would correspond the produce()
method).
means the the copyAsync() function must synchronize with alpaka::wait() before calling initStorage(). This might be sufficient at least for subsequent testing.
shall we have a PR for this, while a more thorough fix is developed concerning https://github.com/cms-sw/framework-team/issues/989 ? I see that Marino has a commit https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 about it (I didn't test). Let me remind that this is in the critical path for the building of the 2024 HIon menu. @cms-sw/core-l2
https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 is my best-guess of a patch based on the explanations in https://github.com/cms-sw/cmssw/issues/45708#issuecomment-2294166204 (thanks @makortel for debugging the problem), but I don't know if it's correct.
I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events.
shall we have a PR for this, while a more thorough fix is developed concerning cms-sw/framework-team#989 ? I see that Marino has a commit missirol@62620da about it (I didn't test).
Fix along https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 is needed in any case. The https://github.com/cms-sw/framework-team/issues/989 will only help to remove the alpaka::wait()
call in https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2.
Let me remind that this is in the critical path for the building of the 2024 HIon menu.
Could you point me to a timeline?
Also, will the HLT use 14_0_X or 14_1_X for the HI data taking? (@missirol's test used 14_0_14, but my understanding is that 14_1_X would be the HI data taking release cycle). I'm asking early, because whether or not the outcome of https://github.com/cms-sw/framework-team/issues/989 needs to be backported impacts how it will be done (because in 14_1_X-only could use C++20 features).
missirol@62620da is my best-guess of a patch based on the explanations in #45708 (comment)
I'd believe the lines relate to pbv
(i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.
I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events.
:+1: A performance test (to see the cost of the alpaka::wait()
) would also be interesting.
@makortel
Could you point me to a timeline?
please refer to this
notice that any further tracking update hinges on this ticket to enter first.
Also, will the HLT use 14_0_X or 14_1_X for the HI data taking?
HLT will use 14_1_X for actual data-taking, but we're still integrating updates in 14_0_X (and will continue doing so until we have CMSSW_14_1_0
out, when we'll move the confDB template for HLT menu development). Thus we'll need both a master PR and a backport of at least something along the lines of https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 in order to keep moving.
I'd believe the lines relate to
pbv
(i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.
I can confirm it (see https://github.com/cms-sw/cmssw/pull/45743#discussion_r1722143309).
Sorry in advance for my ignorance...
I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.
I don't know how to remove L46; I thought the initStorage
method required a PhiBinnerView
as function argument.
If I remove L47-51, the reproducer crashes as follows.
cmsRun: /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:45: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE = -1]: Assertion `view.assoc == this' failed.
If I remove L48-51, the reproducer crashes as follows.
cmsRun: /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:47: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE = -1]: Assertion `view.contentStorage' failed.
Seems like they are necessary after all, thanks for trying out.
For longer term one could ask if the TrackingRecHitsSoA
is really the right place for the PhiBinner
etc. https://github.com/cms-sw/cmssw/issues/44700 has some related discussion, although this question might really belong to https://github.com/cms-sw/cmssw/issues/43796.
It's not your ignorance, it's my hastiness. I had completely missed the context here of this mixed GPU+CPU reconstruction, sorry. They are 100% needed.
@AdrianoDee , no worries, thanks for having a look. :)
A performance test (to see the cost of the alpaka::wait()) would also be interesting.
The check is running..
@makortel @missirol I'm wondering if - instead of blocking with an alpaka::wait(queue)
between the copy and the "fix-after-copy" part - it might work better to enqueue the "fix-after-copy" part in a host task on the same device queue:
// Update the contents address of the phiBinner histo container after the copy from device happened
alpaka::enqueue(queue, [hits = hostData.nHits(), cview = hostData.view()]() {
typename TrackingRecHitSoA<TrackerTraits>::PhiBinnerView pbv;
auto view = cview;
pbv.assoc = &(view.phiBinner());
pbv.offSize = -1;
pbv.offStorage = nullptr;
pbv.contentSize = hits;
pbv.contentStorage = view.phiBinnerStorage();
view.phiBinner().initStorage(pbv);
});
(the extra copy of the view
is needed because the variables captured by the lambda cannot be modified, unless one uses a const_cast
or declares the lambda as mutable
)
For the record, the check below considers a recent pp menu, and shows a very small impact on HLT throughput (within uncertainties).
Reference ([json]()): CMSSW_14_0_14_MULTIARCHS
.
Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
628.0 ± 0.0 ev/s (39800 events, 99.1% overlap)
622.8 ± 0.1 ev/s (39800 events, 99.1% overlap)
623.0 ± 0.1 ev/s (39800 events, 97.7% overlap)
624.8 ± 0.0 ev/s (39800 events, 98.3% overlap)
--------------------
624.7 ± 2.4 ev/s
Target ([json]()): CMSSW_14_0_14_MULTIARCHS
+ https://github.com/cms-sw/cmssw/pull/45744.
Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
625.2 ± 0.0 ev/s (39800 events, 99.0% overlap)
623.3 ± 0.0 ev/s (39800 events, 98.7% overlap)
622.7 ± 0.0 ev/s (39800 events, 98.9% overlap)
622.6 ± 0.0 ev/s (39800 events, 99.3% overlap)
--------------------
623.4 ± 1.2 ev/s
[*]
/cdaq/physics/Run2024/2e34/v1.4.3/HLT/V2
(recent pp menu).hilton-c2b02-44-01
(same hardware as a 2022/23 HLT node) (CPUs: 2 AMD EPYC 7763 64-Core; GPUs: 2 NVIDIA Tesla T4).@fwyzard Could you remind who controls the CPU thread(s) the Alpaka host task gets run in (in case of CUDA backend), and how are those threads managed?
The implementation has changed a few times since we started using alpaka, so I'm not 100% sure.
From a quick look at the code, it seems that each CUDA queue has a lazy-initialised host thread associated to it; this thread is initialised the first time a host task is submitted.
The thread has a queue of tasks to execute. Each task is enqueued in a blocked state, waiting on a condition variable. A CUDA callback is used to notify the condition variable, unblock the thread, and execute the task.
Thanks @fwyzard. My feeling is that the Alpaka host task might help if this alpaka::wait()
call would have a visible impact (which doesn't seem to be the case though in @missirol's test). On the other hand, its implementation sounds complex-enough that for the longer term I'd expect the extension of CopyTo{Host,Device}
to be worth it (e.g. it won't add any new synchronization calls to the system).
solutions proposed in the short term:
CMSSW_14_0_15
).HIon
menu (see this comment in CMSHLT-3824).EDIT: it looks like these PRs generated the issue https://github.com/cms-sw/cmssw/issues/45834, thus removing the hlt
signature.
For the longer term, assuming we'd want to remove this
alpaka::wait()
call, it could be fairly straightforward to extend theCopyToHost
andCopyToDevice
class templates to allow a post-copy modification operation (in a way the presentCopyToHost::copyAsync()
resembles theacquire()
method inExternalWork
/SynchronizingEDProducer
, the new function would correspond theproduce()
method).
A possibility for CopyToHost<T>::postCopy()
is added in https://github.com/cms-sw/cmssw/pull/45801. (such a facility is not needed for CopyToDevice
, that can do similar operation by enqueuing a kernel call to to the queue)
A possibility for
CopyToHost<T>::postCopy()
is added in #45801. (such a facility is not needed forCopyToDevice
, that can do similar operation by enqueuing a kernel call to to the queue)
Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x.
A possibility for
CopyToHost<T>::postCopy()
is added in #45801. (such a facility is not needed forCopyToDevice
, that can do similar operation by enqueuing a kernel call to to the queue)Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x.
Ok.
+1
+heterogeneous
@cmsbuild, please close
This issue is fully signed and ready to be closed.
The test in [1] crashes at runtime in
CMSSW_14_0_14
when running on a machine with a GPU (I did not try on a machine without one). The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to"serial_sync"
(in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU). This mimics the setup that the HIon group plans to implement in the lead-lead trigger menu of 2024 (see CMSHLT-3284) [*].The stack trace is in [2]. The crash does not happen if one uses
options.accelerators = ['cpu']
.Is [1] supposed to work ? If so, what's going wrong ?
[*] This 'mixed' approach (pixel local reconstruction on GPU, tracking and vertexing on CPU) has already been used in the 2023 HIon run, back then using the CUDA-based implementation of the pixel reconstruction. Pixel tracking is currently not offloaded to GPUs in the HIon menu because this leads to excessive GPU memory consumption (then, runtime crashes) in lead-lead events (at least with current data-taking conditions and current HLT hardware).
[1]
[2]