Open silviodonato opened 1 year ago
A new Issue was created by @silviodonato Silvio Donato.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Note: the release is 13_0_5_patch1
assign hlt, reconstruction, heterogeneous
New categories assigned: heterogeneous,hlt,reconstruction
@mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
if one removes process.hltSiPixelDigis.cuda
how the rest of the pixel workflow is supposed to run on GPU?
Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis
on GPU vs everything on CPU.
Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis on GPU vs everything on CPU.
That's very very stange. unpaker is supposed to be identical. Nothing can be different (unless the cabiing map is wrong on GPU)
Hello Silvio, Thank you for reporting this. While electron HLT paths are affected, it looks like the real issue is somewhere upstream. Let me tag EGM HLT convenors @RSalvatico and @ravindkv and EGM Reco convenors @sameasy and @Prasant1993, who can follow it up.
Hello,
running on CPU and on GPU the pixel local sequence, I noticed that DetIdedmEDCollection_hltSiPixelDigis__HLTX.
(edm::EDCollection<DetId> "hltSiPixelDigis"
) is different.
(CPU)
root [1] Events->Scan("DetIdedmEDCollection_hltSiPixelDigis__HLTX.obj.size()")
************************
* Row * DetIdedmE *
************************
* 0 * 0 *
************************
vs (GPU)
root [1] Events->Scan("DetIdedmEDCollection_hltSiPixelDigis__HLTX.obj.size()")
************************
* Row * DetIdedmE *
************************
* 0 * 2 *
************************
not sure if this can be the cause of the large difference that appears in hltEgammaElectronPixelSeeds
:
recoElectronSeeds_hltEgammaElectronPixelSeeds__HLTX.@obj.size()
is 4 on CPU and 36 on GPU.
ping @cms-sw/trk-dpg-l2
UserErrorList = cms.vint32(40)
vs
UserErrorList = cms.vint32()
?
But those are "user" errors. Should not affect reco (still better to make the two config consistent....)
I've taken an event where we have differences (290th in the file provided). There is no difference in the local reco objects (clusters,digis and hits). The problem seems that on GPU we get extra SiPixelDigiErrors
. These are stored in the hltPixelDigis
and then are parsed to process.hltSiStripClusters = cms.EDProducer("MeasurementTrackerEventProducer" ...
and are used to flag the "bad detectors". This then messes with the hltEgammaElectronPixelSeeds
producer because trajectories are different (the trajectory builder checks if a module is active or not and a bad one is set to be inactive, if I understood well).
On GPU adding in RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
something like:
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..ab06086d150 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -168,11 +169,11 @@ namespace pixelgpudetails {
template <bool debug = false>
__device__ uint8_t
checkROC(uint32_t errorWord, uint8_t fedId, uint32_t link, const SiPixelROCsStatusAndMapping *cablingMap) {
uint8_t errorType = (errorWord >> sipixelconstants::ROC_shift) & sipixelconstants::ERROR_mask;
if (errorType < 25)
return 0;
+ printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId+1200,link,sipixelconstants::ROC_shift,sipixelconstants::ERROR_mask);
bool errorFound = false;
and for CPU side adding inRecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
something like:
diff --git a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
index 9bde98bef92..10cd0154d6d 100644
--- a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
+++ b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
@@ -27,8 +28,11 @@ bool ErrorChecker::checkROC(bool& errorsInEvent,
Word32& errorWord,
SiPixelFormatterErrors& errors) const {
int errorType = (errorWord >> ROC_shift) & ERROR_mask;
+
if LIKELY (errorType < 25)
return true;
+
+ printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId,(errorWord >> LINK_shift) & LINK_mask,ROC_shift,ERROR_mask);
the GPU output has a couple of extra lines:
3358a3359
> errorType;29;1201668113;1208;17
3359a3361
> errorType;29;329252868;1227;4
Now why this is happening still I don't know. But the difference is there and it's causing, in this event, modules 303083540
to be not active when running on GPU while it is when running on CPU.
here there is an early return https://cmssdt.cern.ch/dxr/CMSSW/source/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc#55 here not https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu#207
maybe also for other cases
this change alone (which if I am not mistaken would put the FED error zoology treatment on par between GPU and CPU)
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..2f728fe024e 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -189,13 +189,13 @@ namespace pixelgpudetails {
case (26): {
if constexpr (debug)
printf("Gap word found (errorType = 26)\n");
- errorFound = true;
+ errorFound = false;
break;
}
case (27): {
if constexpr (debug)
printf("Dummy word found (errorType = 27)\n");
- errorFound = true;
+ errorFound = false;
break;
}
case (28): {
@@ -208,8 +208,10 @@ namespace pixelgpudetails {
if constexpr (debug)
printf("Timeout on a channel (errorType = 29)\n");
if ((errorWord >> sipixelconstants::OMIT_ERR_shift) & sipixelconstants::OMIT_ERR_mask) {
+ errorFound = false;
if constexpr (debug)
printf("...first errorType=29 error, this gets masked out\n");
+ break;
}
errorFound = true;
break;
only gets the GPU count down of 1 (from 82 to 81 passing events).
Incidentally I realized we're not monitoring in the DQM the CPU vs GPU differences for the SiPixelDigiErrors
(which instead we should, as demonstrated here).
Hi everyone, following up on the comments and using this commit from Marco - I reran the to check the differences between CPU and GPU for Double_Ele*
paths. They are back to the 2-3% levels as before.
For example: Before implementing the commit (GPU was accepting 20% more events )
1135985 | 298 | 360 | 64 | -2 | 66 | 22.15 | HLT_DoubleEle6p5_eta1p22_mMax6_v3 |
---|
Now with the change:
1224276 | 708 | 702 | 5 | -11 | 16 | 2.26 | HLT_DoubleEle6p5_eta1p22_mMax6_v4 |
---|
Similarly for the other Double_Ele
paths.
The details are in this spreadsheet
@gparida can you please check the proposed fix at https://github.com/cms-sw/cmssw/pull/42010 as well?
Hi everyone, I checked with the PR https://github.com/cms-sw/cmssw/pull/42010 on same Data and the release, With the PR, the differences are reduced by 8-9% level
Before : | 1135985 | 154 | 190 | 36 | 0 | 36 | 23.38 | HLT_DoubleEle9p5_eta1p22_mMax6_v3 |
---|
Now with the PR:
915722 | 221 | 246 | 28 | -3 | 31 | 14.0271 | HLT_DoubleEle9p5_eta1p22_mMax6_v4 |
---|
Similarly for the other DoubleEle*
paths. Details are in this spreadsheet.
For the record there were two additional fixes:
PR https://github.com/cms-sw/cmssw/pull/42977 was included in CMSSW_13_2_6_patch1, that went online on October 13th 2023 for HI collision runs starting from run 375083. In the subsequent fill n. 9254 no further CPU vs GPU Pixel FED error mismatches were observed in online DQM, see e.g. for run 375110 .
For more information about the effect on the offline pp menu, see https://github.com/cms-sw/cmssw/pull/42978#issuecomment-1762781959.
Dear all,
@gparida recently made a new CPU vs CPU+GPU comparison of the trigger result of the 2023 HLT menu. The results showed a very large difference in the soft di-electron parking. Basically
HLT_DoubleEleXX_eta1p22_mMax6_v3
have a +25% of rate when running on GPU. The good news is that almost all events triggered by CPU are triggered also by GPU. This means that we are not loosing signal events at P5.The di-electron paths were recently updated in CMSHLT-2635 with the usage of triplets, instead of doublets, in the electron reconstruction.
Minor differences were already visible in the old version of the path (based on doublets) here , but in that case the average rate of CPU and GPU was compatible.
I investigated a bit the problem, and I see that the differences are already visible in the pixel matching module (
hltDoubleEle4eta1p22PixelMatchFilter
) before the GSF tracking.How to reproduce the problem
I copied in
/afs/cern.ch/work/s/sdonato/public/GPU_May23/hlt_onlypixelmatching_dump.py
a python config containing a fake HLT path running the pixel matching filter, and in/afs/cern.ch/work/s/sdonato/public/GPU_May23/skim.root
740 events passing all the filters before it.Running
you can see that:
The differences are still visible if you remove the following cuda branches
but they disappear if you remove
This means that the origin of the difference is somehow in the pixel unpacker:
@cms-sw/hlt-l2 @cms-sw/egamma-pog-l2 @cms-sw/trk-dpg-l2 @cms-sw/heterogeneous-l2