Open makortel opened 1 year ago
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, hlt, heterogeneous
FYI @cms-sw/ecal-dpg-l2
New categories assigned: heterogeneous,hlt,reconstruction
@missirol,@fwyzard,@clacaputo,@makortel,@mandrenguyen,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
Upon first look, I'm pretty confused. I ran wf 12434.593
with CMSSW_13_3_GPU_X_2023-08-08-2300
on gpu-c2a02-39-04.cms
[0]: no crash, and no log-errors in step2
.
This (no crash, no log-errors) seems to match the results of a previous GPU IB [1]. On the other hand, the crash in [2] comes after a long list of log-error messages such as [3].
[0]
CUDA runtime version 11.8, driver version 12.2, NVIDIA driver version 535.86.10
CUDA device 0: Tesla T4 (sm_75)
[1] https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-07-2300/pyRelValMatrixLogs/run/12434.593_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation.log [2] https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-08-2300/pyRelValMatrixLogs/run/12434.593_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation.log
[3]
%MSG-e EcalRecHitError: EcalRecHitProducer:hltEcalRecHit 09-Aug-2023 09:49:08 CEST Run: 1 Event: 105
No intercalib const found for xtal 0! something wrong with EcalIntercalibConstants in your DB?
%MSG
%MSG-e EcalLaserDbService: EcalRecHitProducer:hltEcalRecHit 09-Aug-2023 09:49:08 CEST Run: 1 Event: 105
DetId is NOT in ECAL
%MSG
I'd guess this is one of those random crashes. On a quick look I didn't see any relevant changes between CMSSW_13_3_GPU_X_2023-08-08-2300 and CMSSW_13_3_X_2023-08-07-2300 (in the latter all workflows succeeded). In CMSSW_13_3_X_2023-08-07-2300 the 12434.593 step 2 log did not contain the No intercalib const found
messages.
A detId 0 should not exist. But where it came from is hard to say if the crash is not reproducible.
Workflow 12434.593 step 2 segfaulted in CMSSW_13_3_GPU_X_2023-08-08-2300 on el8_amd64_gcc11 +
NVIDIA A100-PCIE-40GB
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-08-2300/pyRelValMatrixLogs/run/12434.593_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation.log#/