cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

Pixel Local GPU reco crashes on missing detector input (no Pixel FED) #34496

Closed czangela closed 3 years ago

czangela commented 3 years ago

1. Description

Similar to #34197.

The idea here was to remove all FED channels above 1199*, and run the reconstruction on this skimmed raw data.

This was run on the release CMSSW_12_0_0_pre3, and machine cmg-gpu1080.cern.ch.

[*] where 1200 is the minimum FED number for the silicon pixel detector.

2. Crash

The reconstruction crashes with a segmentation fault:

Current Modules:

Module: SiPixelRecHitFromCUDA:siPixelRecHitsPreSplitting@cuda (crashed)
Module: none

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

Full log: crash.log

3. Reproduce - Short version

From https://aczirkos.web.cern.ch/aczirkos/pixel_crash_test/ run on the provided dataset:

foo@bar$ cmsRun step3.py

4. Reproduce - Long version

0. SSH to GPU equipped machine

Don't forget to be nice and

export CUDA_VISIBLE_DEVICES=P,Q

Where P, Q, etc. are the numbers of the visible GPUs, which you can view with nvidia-smi.

Init release area CMSSW_12_0_0_pre3.

1. generate configs and run

Use pixelTrackingOnly workflow:

136.885502_RunHLTPhy2018D+RunHLTPhy2018D+HLTDR2_2018+RECODR2_2018reHLT_Patatrack_PixelOnlyGPU+HARVEST2018_pixelTrackingOnly

runTheMatrix.py --what gpu -e -l 136.885502 --ibeos --maxSteps=3

2. modifiy step3_RAW2DIGI_RECO_DQM.py

-> add a new module named rawDataCollector to the beginning of the Schedule -> modules after this will see and use this collection

nonpixelfeds = list(range(0,1200))

process.rawDataCollector = cms.EDProducer( "EvFFEDSelector",
    inputTag = cms.InputTag( "tmpRawDataCollectorTag" ),
    fedList = cms.vuint32(nonpixelfeds)
)

Path, sequence, task definition:

process.rawDataCollector_task = cms.Task(process.rawDataCollector)
process.rawDataCollector_sequence = cms.Sequence(process.rawDataCollector_task)
# Path and EndPath definitions
process.rawDataCollector_step=cms.Path(process.rawDataCollector)

Add to schedule

# Schedule definition
process.schedule = cms.Schedule(process.rawDataCollector_step,process.raw2digi_step,process.reconstruction_step,process.dqmoffline_step,process.dqmofflineOnPAT_step,process.RECOoutput_step,process.MINIAODoutput_step,process.DQMoutput_step)
#process.schedule = cms.Schedule(process.raw2digi_step,process.reconstruction_step,process.dqmoffline_step,process.dqmofflineOnPAT_step,process.RECOoutput_step,process.MINIAODoutput_step,process.DQMoutput_step)

3. sed and replace rawDataCollector InputTags

foo@bar$ edmConfigDump step3_RAW2DIGI_RECO_DQM.py | sed 's/"rawDataCollector"/"fedSkimmedData"/g' | sed 's/"tmpRawDataCollectorTag"/"rawDataCollector"/g' step3_sed.py | sed 's/process.rawDataCollector/process.fedSkimmedData/g' > step3.py

run again

foo@bar$ cmsRun step3.py 
cmsbuild commented 3 years ago

A new Issue was created by @czangela .

@Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 years ago

assign reconstruction, heterogeneous

makortel commented 3 years ago

FYI @cms-sw/trk-dpg-l2 @VinInn

cmsbuild commented 3 years ago

New categories assigned: heterogeneous,reconstruction

@slava77,@fwyzard,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

VinInn commented 3 years ago

Interesting as it WAS protected for 0 == nHits_ but just by eye now is buggy (check acquire and then produce) so it is obvious that produce will crash

mmusich commented 3 years ago

with this

diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc
index f2f1497b4ba..5861a0be734 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc
@@ -74,7 +74,7 @@ void SiPixelRecHitFromCUDA::acquire(edm::Event const& iEvent,

   nHits_ = inputData.nHits();

-  LogDebug("SiPixelRecHitFromCUDA") << "converting " << nHits_ << " Hits";
+  LogDebug("SiPixelRecHitFromCUDA") << "converting " << nHits_ << " Hits" << std::endl;

   if (0 == nHits_)
     return;
@@ -83,18 +83,21 @@ void SiPixelRecHitFromCUDA::acquire(edm::Event const& iEvent,
 }

 void SiPixelRecHitFromCUDA::produce(edm::Event& iEvent, edm::EventSetup const& es) {
+
   // allocate a buffer for the indices of the clusters
   auto hmsp = std::make_unique<uint32_t[]>(gpuClustering::maxNumModules + 1);
-  std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + gpuClustering::maxNumModules + 1, hmsp.get());
-  // wrap the buffer in a HostProduct, and move it to the Event, without reallocating the buffer or affecting hitsModuleStart
-  iEvent.emplace(hostPutToken_, std::move(hmsp));

   SiPixelRecHitCollection output;
+  output.reserve(gpuClustering::maxNumModules, nHits_);
   if (0 == nHits_) {
     iEvent.emplace(rechitsPutToken_, std::move(output));
+    iEvent.emplace(hostPutToken_, std::move(hmsp));
     return;
   }
-  output.reserve(gpuClustering::maxNumModules, nHits_);
+
+  std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + gpuClustering::maxNumModules + 1, hmsp.get());
+  // wrap the buffer in a HostProduct, and move it to the Event, without reallocating the buffer or affecting hitsModuleStart
+  iEvent.emplace(hostPutToken_, std::move(hmsp));

   auto xl = store32_.get();
   auto yl = xl + nHits_;

it runs for me. Acceptable?

VinInn commented 3 years ago

one can leave the reserve after the return; (very very minor). I think should be ok.

still: was tested (long long ago) with nHits_==0; and the history has been erased by the file renaming. So no way to understand when and why was changed.

VinInn commented 3 years ago

ok found a version in CMSSW_11_1_0_pre8_Patatrack when was named SiPixelRecHitFromSOA.cc and the code is the same. No clue "how" was tested then... (maybe before introducing "hmsp")

VinInn commented 3 years ago

maybe easier to just protect the copy

if(hitsModuleStart_) std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + gpuClustering::maxNumModules + 1, hmsp.get());
makortel commented 3 years ago

+heterogeneous

PRs to master and 11_3_X have been merged

slava77 commented 3 years ago

+reconstruction

34501 and #34503 were merged

cmsbuild commented 3 years ago

This issue is fully signed and ready to be closed.