cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715

Open silviodonato opened 1 year ago

silviodonato commented 1 year ago

Dear all,

@gparida recently made a new CPU vs CPU+GPU comparison of the trigger result of the 2023 HLT menu. The results showed a very large difference in the soft di-electron parking. Basically HLT_DoubleEleXX_eta1p22_mMax6_v3 have a +25% of rate when running on GPU. The good news is that almost all events triggered by CPU are triggered also by GPU. This means that we are not loosing signal events at P5.

The di-electron paths were recently updated in CMSHLT-2635 with the usage of triplets, instead of doublets, in the electron reconstruction.

Minor differences were already visible in the old version of the path (based on doublets) here , but in that case the average rate of CPU and GPU was compatible.

I investigated a bit the problem, and I see that the differences are already visible in the pixel matching module (hltDoubleEle4eta1p22PixelMatchFilter) before the GSF tracking.

How to reproduce the problem

I copied in /afs/cern.ch/work/s/sdonato/public/GPU_May23/hlt_onlypixelmatching_dump.py a python config containing a fake HLT path running the pixel matching filter, and in /afs/cern.ch/work/s/sdonato/public/GPU_May23/skim.root 740 events passing all the filters before it.

Running

CUDA_VISIBLE_DEVICES=0 cmsRun hlt_onlypixelmatching_dump.py >& logGPU &
CUDA_VISIBLE_DEVICES= cmsRun hlt_onlypixelmatching_dump.py >& logCPU &

you can see that:

The differences are still visible if you remove the following cuda branches

del process.hltEcalDigis.cuda
del process.hltEcalUncalibRecHit.cuda
del process.hltHbhereco.cuda
del process.hltPixelTracksSoA.cuda
del process.hltPixelVerticesSoA.cuda
del process.hltSiPixelRecHits.cuda
del process.hltSiPixelRecHitsSoA.cuda
del process.hltSiPixelClusters.cuda

but they disappear if you remove

del process.hltSiPixelDigis.cuda

This means that the origin of the difference is somehow in the pixel unpacker:

process.hltSiPixelClustersGPU = cms.EDProducer("SiPixelRawToClusterCUDA",
    CablingMapLabel = cms.string(''),
    IncludeErrors = cms.bool(True),
    InputLabel = cms.InputTag("rawDataCollector"),
    Regions = cms.PSet(

    ),
    UseQualityInfo = cms.bool(False),
    clusterThreshold_layer1 = cms.int32(4000),
    clusterThreshold_otherLayers = cms.int32(4000),
    isRun2 = cms.bool(False)
)

process.hltSiPixelDigiErrorsSoA = cms.EDProducer("SiPixelDigiErrorsSoAFromCUDA",
    src = cms.InputTag("hltSiPixelClustersGPU")
)

process.hltSiPixelDigisFromSoA = cms.EDProducer("SiPixelDigiErrorsFromSoA",
    CablingMapLabel = cms.string(''),
    ErrorList = cms.vint32(29),
    UsePhase1 = cms.bool(True),
    UserErrorList = cms.vint32(40),
    digiErrorSoASrc = cms.InputTag("hltSiPixelDigiErrorsSoA")
)

process.hltSiPixelDigisLegacy = cms.EDProducer("SiPixelRawToDigi",
    CablingMapLabel = cms.string(''),
    ErrorList = cms.vint32(29),
    IncludeErrors = cms.bool(True),
    InputLabel = cms.InputTag("rawDataCollector"),
    Regions = cms.PSet(

    ),
    SiPixelQualityLabel = cms.string(''),
    UsePhase1 = cms.bool(True),
    UsePilotBlade = cms.bool(False),
    UseQualityInfo = cms.bool(False),
    UserErrorList = cms.vint32()
)

process.hltSiPixelDigis = SwitchProducerCUDA(
    cpu = cms.EDAlias(
        hltSiPixelDigisLegacy = cms.VPSet(
            cms.PSet(
                type = cms.string('DetIdedmEDCollection')
            ),
            cms.PSet(
                type = cms.string('SiPixelRawDataErroredmDetSetVector')
            ),
            cms.PSet(
                type = cms.string('PixelFEDChanneledmNewDetSetVector')
            )
        )
    ),
    cuda = cms.EDAlias(
        hltSiPixelDigisFromSoA = cms.VPSet(cms.PSet(
            type = cms.string('*')
        ))
    )
)

@cms-sw/hlt-l2 @cms-sw/egamma-pog-l2 @cms-sw/trk-dpg-l2 @cms-sw/heterogeneous-l2

cmsbuild commented 1 year ago

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

AdrianoDee commented 1 year ago

Note: the release is 13_0_5_patch1

makortel commented 1 year ago

assign hlt, reconstruction, heterogeneous

cmsbuild commented 1 year ago

New categories assigned: heterogeneous,hlt,reconstruction

@mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

VinInn commented 1 year ago

if one removes process.hltSiPixelDigis.cuda how the rest of the pixel workflow is supposed to run on GPU?

silviodonato commented 1 year ago

Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis on GPU vs everything on CPU.

VinInn commented 1 year ago

Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis on GPU vs everything on CPU.

That's very very stange. unpaker is supposed to be identical. Nothing can be different (unless the cabiing map is wrong on GPU)

swagata87 commented 1 year ago

Hello Silvio, Thank you for reporting this. While electron HLT paths are affected, it looks like the real issue is somewhere upstream. Let me tag EGM HLT convenors @RSalvatico and @ravindkv and EGM Reco convenors @sameasy and @Prasant1993, who can follow it up.

silviodonato commented 1 year ago

Hello, running on CPU and on GPU the pixel local sequence, I noticed that DetIdedmEDCollection_hltSiPixelDigis__HLTX. (edm::EDCollection<DetId> "hltSiPixelDigis") is different. (CPU)

root [1] Events->Scan("DetIdedmEDCollection_hltSiPixelDigis__HLTX.obj.size()")
************************
*    Row   * DetIdedmE *
************************
*        0 *         0 *
************************

vs (GPU)

root [1] Events->Scan("DetIdedmEDCollection_hltSiPixelDigis__HLTX.obj.size()")
************************
*    Row   * DetIdedmE *
************************
*        0 *         2 *
************************

not sure if this can be the cause of the large difference that appears in hltEgammaElectronPixelSeeds:

recoElectronSeeds_hltEgammaElectronPixelSeeds__HLTX.@obj.size() is 4 on CPU and 36 on GPU.

silviodonato commented 1 year ago

ping @cms-sw/trk-dpg-l2

VinInn commented 1 year ago

UserErrorList = cms.vint32(40) vs UserErrorList = cms.vint32() ?

But those are "user" errors. Should not affect reco (still better to make the two config consistent....)

AdrianoDee commented 1 year ago

I've taken an event where we have differences (290th in the file provided). There is no difference in the local reco objects (clusters,digis and hits). The problem seems that on GPU we get extra SiPixelDigiErrors. These are stored in the hltPixelDigis and then are parsed to process.hltSiStripClusters = cms.EDProducer("MeasurementTrackerEventProducer" ... and are used to flag the "bad detectors". This then messes with the hltEgammaElectronPixelSeeds producer because trajectories are different (the trajectory builder checks if a module is active or not and a bad one is set to be inactive, if I understood well).

On GPU adding in RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu something like:

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..ab06086d150 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -168,11 +169,11 @@ namespace pixelgpudetails {
   template <bool debug = false>
   __device__ uint8_t
   checkROC(uint32_t errorWord, uint8_t fedId, uint32_t link, const SiPixelROCsStatusAndMapping *cablingMap) {
     uint8_t errorType = (errorWord >> sipixelconstants::ROC_shift) & sipixelconstants::ERROR_mask;
     if (errorType < 25)
       return 0;
+    printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId+1200,link,sipixelconstants::ROC_shift,sipixelconstants::ERROR_mask);
     bool errorFound = false;

and for CPU side adding inRecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu something like:

diff --git a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
index 9bde98bef92..10cd0154d6d 100644
--- a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
+++ b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
@@ -27,8 +28,11 @@ bool ErrorChecker::checkROC(bool& errorsInEvent,
                             Word32& errorWord,
                             SiPixelFormatterErrors& errors) const {
   int errorType = (errorWord >> ROC_shift) & ERROR_mask;
+
   if LIKELY (errorType < 25)
     return true;
+  
+  printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId,(errorWord >> LINK_shift) & LINK_mask,ROC_shift,ERROR_mask);

the GPU output has a couple of extra lines:

3358a3359
> errorType;29;1201668113;1208;17
3359a3361
> errorType;29;329252868;1227;4

Now why this is happening still I don't know. But the difference is there and it's causing, in this event, modules 303083540 to be not active when running on GPU while it is when running on CPU.

VinInn commented 1 year ago

here there is an early return https://cmssdt.cern.ch/dxr/CMSSW/source/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc#55 here not https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu#207

maybe also for other cases

mmusich commented 1 year ago

this change alone (which if I am not mistaken would put the FED error zoology treatment on par between GPU and CPU)

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..2f728fe024e 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -189,13 +189,13 @@ namespace pixelgpudetails {
       case (26): {
         if constexpr (debug)
           printf("Gap word found (errorType = 26)\n");
-        errorFound = true;
+        errorFound = false;
         break;
       }
       case (27): {
         if constexpr (debug)
           printf("Dummy word found (errorType = 27)\n");
-        errorFound = true;
+        errorFound = false;
         break;
       }
       case (28): {
@@ -208,8 +208,10 @@ namespace pixelgpudetails {
         if constexpr (debug)
           printf("Timeout on a channel (errorType = 29)\n");
         if ((errorWord >> sipixelconstants::OMIT_ERR_shift) & sipixelconstants::OMIT_ERR_mask) {
+          errorFound = false;
           if constexpr (debug)
             printf("...first errorType=29 error, this gets masked out\n");
+          break;
         }
         errorFound = true;
         break;

only gets the GPU count down of 1 (from 82 to 81 passing events). Incidentally I realized we're not monitoring in the DQM the CPU vs GPU differences for the SiPixelDigiErrors (which instead we should, as demonstrated here).

gparida commented 1 year ago

Hi everyone, following up on the comments and using this commit from Marco - I reran the to check the differences between CPU and GPU for Double_Ele* paths. They are back to the 2-3% levels as before.

For example: Before implementing the commit (GPU was accepting 20% more events )

1135985 298 360 64 -2 66 22.15 HLT_DoubleEle6p5_eta1p22_mMax6_v3

Now with the change:

1224276 708 702 5 -11 16 2.26 HLT_DoubleEle6p5_eta1p22_mMax6_v4

Similarly for the other Double_Ele paths.

The details are in this spreadsheet

mmusich commented 1 year ago

@gparida can you please check the proposed fix at https://github.com/cms-sw/cmssw/pull/42010 as well?

gparida commented 1 year ago

Hi everyone, I checked with the PR https://github.com/cms-sw/cmssw/pull/42010 on same Data and the release, With the PR, the differences are reduced by 8-9% level

Before : 1135985 154 190 36 0 36 23.38 HLT_DoubleEle9p5_eta1p22_mMax6_v3

Now with the PR:

915722 221 246 28 -3 31 14.0271 HLT_DoubleEle9p5_eta1p22_mMax6_v4

Similarly for the other DoubleEle* paths. Details are in this spreadsheet.

mmusich commented 1 year ago

For the record there were two additional fixes:

mmusich commented 1 year ago

PR https://github.com/cms-sw/cmssw/pull/42977 was included in CMSSW_13_2_6_patch1, that went online on October 13th 2023 for HI collision runs starting from run 375083. In the subsequent fill n. 9254 no further CPU vs GPU Pixel FED error mismatches were observed in online DQM, see e.g. for run 375110 .

For more information about the effect on the offline pp menu, see https://github.com/cms-sw/cmssw/pull/42978#issuecomment-1762781959.