cms-patatrack / cmssw

CMSSW fork of the Patatrack project
https://patatrack.web.cern.ch/patatrack/index.html
Apache License 2.0
2 stars 5 forks source link

Do not use the ECAL calibrated rechits from the GPU workflow #592

Closed fwyzard closed 3 years ago

fwyzard commented 3 years ago

The ECAL calibrated rechits produced on the GPU are not yet correct. Disable using them in the gpu workflows until they are working and validated.

fwyzard commented 3 years ago

Validation summary

Reference release CMSSW_11_2_0_pre10 at 6c149b2963ee Development branch cms-patatrack/CMSSW_11_2_X_Patatrack at 6a192beda960 Testing branch cms-patatrack/CMSSW_11_2_X_Patatrack at 6a192beda960 with PRs:

Validation plots

/RelValTTbar_14TeV/CMSSW_11_2_0_pre7-PU_112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v2/GEN-SIM-DIGI-RAW

/RelValZEE_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

Validation plots (CPU vs GPU)

/RelValTTbar_14TeV/CMSSW_11_2_0_pre7-PU_112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v2/GEN-SIM-DIGI-RAW

/RelValZEE_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.885502.png zoom-136.885502.png scan-136.885512.png zoom-136.885512.png scan-136.885522.png zoom-136.885522.png

logs and nvprof/nvvp profiles

/RelValTTbar_14TeV/CMSSW_11_2_0_pre7-PU_112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v2/GEN-SIM-DIGI-RAW

/RelValZEE_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/4a188869c781252b40b258ed9e5e9128eddef122/log .

thomreis commented 3 years ago

Hi @fwyzard are there any special permissions needed to see the validation plots? I get 404 errors or "not found".

fwyzard commented 3 years ago

No - but I need to trigger publishing them by hand...

thomreis commented 3 years ago

In my test a comparison of uncalibrated RecHits shows agreement between CPU and GPU: CPU: EcalUncalibratedRecHitsSorted_ecalMultiFitUncalibRecHit_EcalUncalibRecHitsEB_amplitude_cpu GPU: EcalUncalibratedRecHitsSorted_ecalMultiFitUncalibRecHit_EcalUncalibRecHitsEB_amplitude_gpu

Comparing the RecHits shows differences. More RecHits are found for the CPU version (This includes PR #592 so the same RecHit producer should run for the CPU and GPU WFs): CPU: EcalRecHitsSorted_ecalRecHit_EcalRecHitsEB_energy_cpu GPU: EcalRecHitsSorted_ecalRecHit_EcalRecHitsEB_energy_gpu

thomreis commented 3 years ago

The trigger report for the GPU configuration is not what I was expecting though. It seems as if the CPU module also runs for the uncalibrated RecHits:

TrigReport        200        100        100          0          0 ecalMultiFitUncalibRecHit
TrigReport        200        100        100          0          0 ecalMultiFitUncalibRecHitGPU
TrigReport        200        100        100          0          0 ecalMultiFitUncalibRecHitSoA

So perhaps the agreement in the post above actually comes from comparing CPU outputs with CPU outputs.

For the RecHits the GPU modules do not process events though as expected.

TrigReport        200        100        100          0          0 ecalRecHit
TrigReport          0          0          0          0          0 ecalRecHitGPU
TrigReport          0          0          0          0          0 ecalRecHitSoA
thomreis commented 3 years ago

Looking closer at the configuration it seems that ecalMultiFitUncalibRecHit is a conversion module from GPU to CPU. This seems to be OK then.

thomreis commented 3 years ago

Since the RecHitProducer is the same for CPU and GPU, the differences in the RecHit energy plot probably come from the inputs to the module. Looking a bit closer at the UncalibRecHits there are some variables that do show differences between the CPU and the GPU version. Agreement is seen for amplitude, pedestal, while differences are seen for amplitudeError (0 for GPU), jitter (0 for GPU), chi2 (very small), OOTamplitudes, OOTchi2, flags, and aux (0 for GPU). Which of these variables are used by the RecHitProducer?

thomreis commented 3 years ago

Hi @fwyzard what does the error in cuda-memcheck --tool synccheck for the .512 WFs mean? Some issue with the synchronisation?

fwyzard commented 3 years ago

hi @thomreis sorry about that - you can disregard the synccheck errors, I believe that they are false positives

fwyzard commented 3 years ago

Agreement is seen for amplitude, pedestal, while differences are seen for amplitudeError (0 for GPU), jitter (0 for GPU), chi2 (very small), OOTamplitudes, OOTchi2, flags, and aux (0 for GPU). Which of these variables are used by the RecHitProducer?

No idea ...