Closed trtomei closed 4 weeks ago
Is it possible that
alpaka::memset(event.queue(), hostProduct.buffer(), 0xFF);
is NOT synchronous? (which does not make sense for the host buffer)
Very good point.
Yes, if the buffer is in pinned host memory in preparation for a GPU copy, the queue will be the GPU one, and the operation is potentially asynchronous.
But this should not change the behaviour of the ESProducer
apart from the DEBUG
statements.
I mean - maybe it does change the behaviour, but then it's a bug and we should fix it.
I have been investigating with the new Alpaka HCAL local reco.
Start with CMSSW_14_0_9_patch2_MULTIARCHS
+ #45277 + #45278 + #45324 + #45342 + #45210.
Use the same script as in https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199709930, although append the following to the configs as well:
from HLTrigger.Configuration.customizeHLTforAlpaka import customizeHLTforAlpakaPFSoA
from HLTrigger.Configuration.customizeHLTforAlpaka import customizeHLTforAlpakaHcalLocalReco
process = customizeHLTforAlpakaHcalLocalReco(process)
process = customizeHLTforAlpakaPFSoA(process)
Now crashing with:
----- Begin Fatal Exception 03-Jul-2024 10:29:09 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 380399 lumi: 123 event: 121628668 stream: 0
[1] Running path 'HLT_Mu12_DoublePFJets54MaxDeta1p6_PNet2BTag_0p11_v2'
[2] Calling method for module HBHERecHitProducerPortable@alpaka/'hltHbheRecoSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_9_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_9_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access wasencountered'!
----- End Fatal Exception -------------------------------------------------
Here is the log for one of the runs: hlt_run380399.log
And the corresponding compute-sanitizer --tool memcheck
log:
memcheck.log
Start with CMSSW_14_0_9_patch2_MULTIARCHS + https://github.com/cms-sw/cmssw/pull/45277 + https://github.com/cms-sw/cmssw/pull/45278 + https://github.com/cms-sw/cmssw/pull/45324 + https://github.com/cms-sw/cmssw/pull/45342 + https://github.com/cms-sw/cmssw/pull/45210.
Just for the record, a good bunch of this (excepted the alpaka Hcal local reco PR) will be included in CMSSW_14_0_10_MULTIARCHS
, which is built but not yet uploaded.
I do not understand why memset of the host-buffer should be asynchronous given that filling it (as in the loop in the producers) is by definition synchronous.... (I mean, one can make things complicated and schedule it as a cpu-function on the gpu queue but 1) is not what we do, 2) what's the gain?
Anyhow at the moment the memset in the cacheAllocator is buggy as it may set the memory after it is correctly filled by the producer
I do not understand why memset of the host-buffer should be asynchronous given that filling it (as in the loop in the producers) is by definition synchronous....
Technically: because, for pinned host memory, the allocation itself is potentially asynchronous.
In practice: the asynchronous allocation are actually disabled for host-side allocator that uses a device queue, so there is actually a bug in how the memset
is implemented for this specific case.
(I mean, one can make things complicated and schedule it as a cpu-function on the gpu queue but 1) is not what we do, 2) what's the gain?)
Correct.
Anyhow at the moment the memset in the cacheAllocator is buggy as it may set the memory after it is correctly filled by the producer
I agree, and will prepare a fix.
@VinInn do you think https://github.com/cms-sw/cmssw/pull/45368 fixes this problem ?
A more efficient solution would be to make the memset itself non-asynchronous; I'm not sure if that can be done easily when the queue
associated with the allocation is asynchronous, though.
with the fix in #45368 no more crashes are observed. I run the full script. This also implies that the crash at the HLT farm are not easy to reproduce or emulate.
dear all, I think am a bit lost with what's the current (as of CMSSW_14_0_12
) expectation in terms of crashes when running with the options to fill with junk memory the host and device allocators (cf https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199627075). Is the error at https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199586926 expected to be cured in CMSSW_14_0_12
? @jsamudio
@mmusich I am also not sure where things stand currently, I was under the impression that #45368 was going to be the answer to the crashes. I guess this not the case?
The various PR implement and fix the possibility of filling memory with zero or junk values, that may ne helpful for debugging.
Regular workflows are non affected either way.
The various PR implement and fix the possibility of filling memory with zero or junk values, that may ne helpful for debugging.
right. But do you expect that when explicitly filling memory with zero or junk values in CMSSW_14_0_12
and running the current HLT menu (V1.3) over recent data to have still crashes or not?
I don't know.
According to the test that Vincenzo did, I don't expect it to fix the crashes.
I don't know. According to the test that Vincenzo did, I don't expect it to fix the crashes.
I see. The reason why I ask is basically https://github.com/cms-sw/cmssw/issues/45555#issuecomment-2250084953.
maybe we should change the assert is some form of logwarning with more details (full dump of vectors?) (just a printf, but careful with the size of the output)
Coming back to this suggestion, since this issue is still unsolved and causing crashes (approx. a few per week), would it make sense to integrate the following change (to have a bit more info in the log files when there is a crash) ?
(if we just demote the assert to a warning, I fear the warning might be missed, since I don't think we systematically check the logs of HLT jobs that don't crash)
diff --git a/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h b/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h
index a8c428e2f5a..2f78723e61d 100644
--- a/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h
+++ b/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h
@@ -74,7 +74,19 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::vertexFinder {
alpaka::syncBlockThreads(acc);
// reuse nn
for (auto i : cms::alpakatools::uniform_elements(acc, foundClusters)) {
- ALPAKA_ASSERT_ACC(wv[i] > 0.f);
+ bool const wv_cond = (wv[i] > 0.f);
+ if (not wv_cond) {
+ printf("ERROR: wv[%d] (%f) > 0.f failed\n", i, wv[i]);
+ // printing info on tracks associated to this vertex
+ for (auto trk_i = 0u; trk_i < nt; ++trk_i) {
+ if (iv[trk_i] != int(i)) {
+ continue;
+ }
+ printf(" iv[%d]=%d zt[%d]=%f ezt2[%d]=%f\n", trk_i, iv[trk_i], trk_i, zt[trk_i], trk_i, ezt2[trk_i]);
+ }
+ ALPAKA_ASSERT_ACC(false);
+ }
+
zv[i] /= wv[i];
nn[i] = -1; // ndof
}
Coming back to this suggestion, since this issue is still unsolved and causing crashes (approx. a few per week), would it make sense to integrate the following change (to have a bit more info in the log files when there is a crash) ?
as discussed at the last TSG meeting, I think that's a possible way forward. Opened:
Thanks @mmusich !
compute-sanitizer --tool=racecheck --racecheck-report=all
does report potential "RAW" (read-after-write) and "WAR" (write-after-read) in RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
:
========= Error: Potential WAR hazard detected at __shared__ 0xafac in block (0,0,0) :
========= Read Thread (479,0,0) at 0x6d30 in /data/user/fwyzard/issue44923/CMSSW_14_0_13_patch1_MULTIARCHS/src/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:148:void alpaka_cuda_async::vertexFinder::splitVertices<alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, (unsigned long)1>, unsigned int>>(const T1 &, reco::ZVertexLayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, vertexFinder::PixelVertexWSSoALayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, float)
========= Write Thread (352,0,0) at 0x5db0 in /data/user/fwyzard/issue44923/CMSSW_14_0_13_patch1_MULTIARCHS/src/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:69:void alpaka_cuda_async::vertexFinder::splitVertices<alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, (unsigned long)1>, unsigned int>>(const T1 &, reco::ZVertexLayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, vertexFinder::PixelVertexWSSoALayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, float)
========= Current Value : 0, Incoming Value : 0
========= ...
=========
Looking at the two lines of RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
(line 69 and line 148) I'm thinking that the problem may happen when this loop rolls over without any synchronisation, and one thread sets nq = 0
on line 69 while another is still looping over it on line 148.
Adding
diff --git a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
index e2ba0b46b8be..be3b20563663 100644
--- a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
+++ b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
@@ -150,6 +150,8 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::vertexFinder {
iv[it[k]] = igv;
}
+ // synchronise the threads before starting the next iteration of the loop of the groups
+ alpaka::syncBlockThreads(acc);
} // loop on vertices
}
before the end of the loop seems to make racecheck
happy (well, it makes it complain about a different piece of code in alpaka_cuda_async::pixelClustering::FindClus
)
assign hlt
New categories assigned: hlt
@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks
Proposed fixes:
CMSSW_14_0_X
).CMSSW_14_0_14_MULTIARCHS
which was deployed on Aug 12, 2024 (see e-log: http://cmsonline.cern.ch/cms-elog/1230042) during run-384365.No crashes of this type have been observed (so far) in the following physics fill 9996
+hlt
+heterogeneous
@cms-sw/reconstruction-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.
+1
This issue is fully signed and ready to be closed.
@cmsbuild, please close
Crashes observed in collisions Run 380399. Stack traces:
We tried to reproduce it with the following recipe, but it didn't reproduce.
Message #8 in the first stack trace seems to point to
alpaka_cuda_async::EcalRawToDigiPortable::produce()
method.@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI
Best regards, Thiago (for FOG)