cms-patatrack / cmssw

CMSSW fork of the Patatrack project
https://patatrack.web.cern.ch/patatrack/index.html
Apache License 2.0
2 stars 5 forks source link

Assertion failed in CPU version of kernel_countMultiplicity #392

Open makortel opened 4 years ago

makortel commented 4 years ago

While running the CPU profiling workflow (customizePixelTracksSoAonCPUForProfiling()) on 11_0_0_pre7_Patatrack at NERSC, I got an assertion failure

Begin processing the 3901st record. Run 321177, Event 188714878, LumiSection 142 on stream 13 at 20-Sep-2019 20:12:57.849 PDT
RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsImpl.h:320: void kernel_countMultiplicity(const HitContainer*, const Quality*, CAConstants::TupleMultiplicity*): Assertion `nhits < 8' failed.
wrong mult 347 -1412
...
Thread 76 (Thread 0x2aaebc280700 (LWP 4401)):
...
#4  <signal handler called>
#5  0x00002aaaad63f207 in raise () from /lib64/libc.so.6
#6  0x00002aaaad6408f8 in abort () from /lib64/libc.so.6
#7  0x00002aaaad638026 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002aaaad6380d2 in __assert_fail () from /lib64/libc.so.6
#9  0x00002aab9a99280d in CAHitNtupletGeneratorKernels<cudaCompat::CPUTraits>::launchKernels(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, TrackSoAT<32768>*, CUstream_st*) () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00002aab9a947413 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, float) const () from .../cmssw/CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00002aab9a993b99 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixel
TripletsPlugins.so
...

when running on 64 streams/threads. This failure occurred only once though during my tests on 4x{1, 16, 32}, 10x64, 4x{1, 20, 40}, and 10x80 streams/threads, but I thought to report it anyway ("NxM" meaning "N runs of M streams/threads").

makortel commented 4 years ago

FYI @VinInn

VinInn commented 4 years ago

would be interesting to understand if it is reproducible (at event level). It could be due to "memory" corruption. What events were those? mc/real? 2018/2021?

makortel commented 4 years ago

would be interesting to understand if it is reproducible (at event level).

It's not very reproducible. As I wrote in the description, it occurred once in 44 executions (with varying number of streams/threads). I could of course try to repeat it (with high thread count).

It could be due to "memory" corruption. What events were those? mc/real? 2018/2021?

Real, from the LS 142 of run 321177 from 2018D JetHT ("the usual").

makortel commented 4 years ago

On a closer inspection I found another assertion failure in the logs of the 44 jobs. It was certainly a different event

Begin processing the 801st record. Run 321177, Event 188206932, LumiSection 142 on stream 0 at 18-Sep-2019 10:21:15.290 PDT
cmsRun: .../CMSSW_11_0_0_pre7_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/CAHit
NtupletGeneratorKernelsImpl.h:320: void kernel_countMultiplicity(const HitContainer*, const Quality*, CAConstants::TupleMultiplicity*): Assertion `nhits < 8' failed.
wrong mult 439 -1787
...
Thread 91 (Thread 0x2aaec5800700 (LWP 70795)):
...
#5  0x00002aaaad63f207 in raise () from /lib64/libc.so.6
#6  0x00002aaaad6408f8 in abort () from /lib64/libc.so.6
#7  0x00002aaaad638026 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002aaaad6380d2 in __assert_fail () from /lib64/libc.so.6
#9  0x00002aab9a8c980d in CAHitNtupletGeneratorKernels<cudaCompat::CPUTraits>::launchKernels(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, TrackSoAT<32768>*, CUstream_st*) () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00002aab9a87e413 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, float) const () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00002aab9a8cab99 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so

(this was on 80-stream/thread job)

makortel commented 4 years ago

Hmm, I just repeated 80-stream/thread job for 150 times, no failures.

VinInn commented 4 years ago

The CPU workflow is supposed to be thread safe (but the stats (not used in perfWf) that I have still to fix (require proper handling of AtomicAdd)) I can only think of uninitialized memory that is zeroed by "chance". It was the case at some point. One may have to try to run it under valgrind... I will not blame cosmic rays nor bad-memory at NIRSC