cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

[GPU] Workflow failures when running the alpaka customization in presence of a `Fake` menu #44119

Closed mmusich closed 7 months ago

mmusich commented 7 months ago

Several workflows {12434,12450}.{402,403,404,412} fail in GPU IB tests in CMSSW_14_1_GPU_X_2024-02-26-2300 along:

DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2023,ENDJOB
We have determined that this is simulation (if not, rerun cmsDriver.py with --data)
with DB:
entry filelist:step1_dasquery.log
found files:  ['/store/relval/CMSSW_13_0_10/RelValTTbar_14TeV/GEN-SIM/130X_mcRun3_2023_realistic_withEarly2023BS_v1_2023-v1/2590000/4a9c4099-1812-4afd-9c94-6f9409595929.root', '/store/relval/CMSSW_13_0_10/RelValTTbar_14TeV/GEN-SIM/130X_mcRun3_2023_realistic_withEarly2023BS_v1_2023-v1/2590000/99db1b20-ec34-4bff-84df-dfffcbdfb184.root', '/store/relval/CMSSW_13_0_10/RelValTTbar_14TeV/GEN-SIM/130X_mcRun3_2023_realistic_withEarly2023BS_v1_2023-v1/2590000/c388e800-ddaa-408d-a2ec-b40a9b8c7a08.root']
Step: DIGI Spec: ['pdigi_valid']
Step: L1 Spec: 
Step: DIGI2RAW Spec: 
Step: HLT Spec: ['@relval2023']
Traceback (most recent call last):
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 40, in <module>
    run()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 16, in run
    configBuilder.prepare()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 2310, in prepare
    self.addStandardSequences()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 850, in addStandardSequences
    getattr(self,"prepare_"+stepName)(stepSpec = '+'.join(stepSpec))
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 1670, in prepare_HLT
    self.loadAndRemember('HLTrigger/Configuration/HLT_%s_cff' % stepSpec)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 376, in loadAndRemember
    self.process.load(includeFile)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 761, in load
    module = __import__(moduleName)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/HLT_Fake2_cff.py", line 237, in <module>
    fragment = customizeHLTforCMSSW(fragment,"Fake2")
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforCMSSW.py", line 262, in customizeHLTforCMSSW
    (alpaka & run3_common).makeProcessModifier(customizeHLTforAlpaka).apply(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1980, in apply
    self.__func(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 917, in customizeHLTforAlpaka
    process = customizeHLTforAlpakaEcalLocalReco(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 908, in customizeHLTforAlpakaEcalLocalReco
    process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1656, in __getattribute__
    return getattr(self.__process, name)
AttributeError: 'Process' object has no attribute 'HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask'

this likely comes from the integration of https://github.com/cms-sw/cmssw/pull/44026 that moved @relval2023 to @Fake2.

mmusich commented 7 months ago

assign hlt, heterogeneous

cmsbuild commented 7 months ago

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 7 months ago

cms-bot internal usage

cmsbuild commented 7 months ago

A new Issue was created by @mmusich.

@smuzaffar, @antoniovilela, @Dr15Jones, @makortel, @rappoccio, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 7 months ago

@thomreis FYI

Martin-Grunewald commented 7 months ago

The customisation should check whether HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask actually exists, before messing with it.

thomreis commented 7 months ago

What menu was used for this?

mmusich commented 7 months ago

What menu was used for this?

Fake one. See above. In any case it does not matter. Please provide a fix, since the customization needs to run irrespectively

Martin-Grunewald commented 7 months ago

Hmm, alternatively, it may be best to remove alpaka from these (failing) 2023 (HLT) workflows (as those are now using the Fake menus). Testing alpaka on Fake HLT menus does not make much sense!

mmusich commented 7 months ago

Hmm, alternatively, it may be best to remove alpka from these (failing) 2023 (HLT) workflows (as those are now using the Fake menus). Testing alpaka on Fake HLT menus does not make much sense!

this is what this PR https://github.com/cms-sw/cmssw/pull/44075 is going to do . On the other hand the customization should not break in any circumstance IMHO.

mmusich commented 7 months ago

On the other hand the customization should not break in any circumstance IMHO.

in order to achieve that, though also all the other customization pieces need to comply, perhaps better to remove all years with the fake menu from the alpaka customization

mmusich commented 7 months ago

assign pdmv

cmsbuild commented 7 months ago

New categories assigned: pdmv

@AdrianoDee,@sunilUIET,@miquork you have been requested to review this Pull request/Issue and eventually sign? Thanks

thomreis commented 7 months ago

Would add a condition to this line would fix this?

if hasattr(process, 'HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask') and hasattr(process, 'HLTPreshowerTask'):
    process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)
Martin-Grunewald commented 7 months ago

This error, yes, I think so.

mmusich commented 7 months ago

Would add a condition to this line would fix this?

It does, but then it fails with:

DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2023,ENDJOB
We have determined that this is simulation (if not, rerun cmsDriver.py with --data)
with DB:
entry file:step1.root
Step: DIGI Spec: ['pdigi_valid']
Step: L1 Spec: 
Step: DIGI2RAW Spec: 
Step: HLT Spec: ['@relval2023']
Traceback (most recent call last):
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 40, in <module>
    run()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 16, in run
    configBuilder.prepare()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 2310, in prepare
    self.addStandardSequences()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 850, in addStandardSequences
    getattr(self,"prepare_"+stepName)(stepSpec = '+'.join(stepSpec))
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 1670, in prepare_HLT
    self.loadAndRemember('HLTrigger/Configuration/HLT_%s_cff' % stepSpec)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 376, in loadAndRemember
    self.process.load(includeFile)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 761, in load
    module = __import__(moduleName)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/HLT_Fake2_cff.py", line 237, in <module>
    fragment = customizeHLTforCMSSW(fragment,"Fake2")
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforCMSSW.py", line 262, in customizeHLTforCMSSW
    (alpaka & run3_common).makeProcessModifier(customizeHLTforAlpaka).apply(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1980, in apply
    self.__func(process)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 919, in customizeHLTforAlpaka
    process = customizeHLTforAlpakaPixelReco(process)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 809, in customizeHLTforAlpakaPixelReco
    process = customizeHLTforAlpakaPixelRecoVertexing(process)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 732, in customizeHLTforAlpakaPixelRecoVertexing
    process.hltTrimmedPixelVertices 
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1656, in __getattribute__
    return getattr(self.__process, name)
AttributeError: 'Process' object has no attribute 'hltTrimmedPixelVertices'
thomreis commented 7 months ago

But that is not and issue of the ECAL customisation anymore. Looks like Pixel in this case.

Martin-Grunewald commented 7 months ago

It looks there are more instances where alpaka customisation parts fail on Fake* menus.

44075 (#44076 bp) would fix it from the workflow use-case side?!

mmusich commented 7 months ago

But that is not and issue of the ECAL customisation anymore. Looks like Pixel in this case.

right, but it does not solve the issue.

thomreis commented 7 months ago

right, but it does not solve the issue.

Well it would solve this issue. But there seem to be others.

Martin-Grunewald commented 7 months ago

I guess it is faster to get the PRs in, rather than making alpaka customisations failsafe - given that the alpaka customisation will be folded into the ConfDb menus within a couple of weeks? Or are there issues not fixed by the two PRs?

mmusich commented 7 months ago

Well it would solve this issue. But there seem to be others.

I edited the issue title to be more inclusive, so no, unfortunately it's not an adequate fix.

mmusich commented 7 months ago

Or are there issues not fixed by the two PRs?

getting the PR in will probably remove the failures from the IBs tests, but the workflows will remain broken IIUC

mmusich commented 7 months ago
diff --git a/HLTrigger/Configuration/python/customizeHLTforAlpaka.py b/HLTrigger/Configuration/python/customizeHLTforAlpaka.py
index d1ca276fb3e..a9bdb2feae0 100644
--- a/HLTrigger/Configuration/python/customizeHLTforAlpaka.py
+++ b/HLTrigger/Configuration/python/customizeHLTforAlpaka.py
@@ -190,6 +190,10 @@ def customizeHLTforAlpakaParticleFlowClustering(process):
             pfRecHits = cms.InputTag("hltPFRecHitSoAProducerHCALCPUSerial"),
             )

+    ## failsafe for fake menus
+    if(not hasattr(process,'hltParticleFlowClusterHBHE')):
+        return process
+
     process.hltLegacyPFClusterProducer = cms.EDProducer("LegacyPFClusterProducer",
             src = cms.InputTag("hltPFClusterSoAProducer"),
             pfClusterParams = cms.ESInputTag("pfClusterParamsESProducer:"),
@@ -725,6 +729,10 @@ def customizeHLTforAlpakaPixelRecoVertexing(process):
         src = cms.InputTag("hltPixelVerticesCPUSerial")
     )

+    ## failsafe for fake menus
+    if(not hasattr(process,'hltTrimmedPixelVertices')):
+        return process
+
     process.HLTRecopixelvertexingTask = cms.ConditionalTask(
         process.HLTRecoPixelTracksTask,
         process.hltPixelVerticesSoA,
@@ -905,7 +913,9 @@ def customizeHLTforAlpakaEcalLocalReco(process):
         if hasattr(process, 'hltEcalUncalibRecHitSoA'):
             delattr(process, 'hltEcalUncalibRecHitSoA')

-    process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)
+        ## failsafe for fake menus
+        if hasattr(process, 'HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask') and hasattr(process, 'HLTPreshowerTask'):
+            process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)

     return process

this seems to be enough to avoid runtime failures.

AdrianoDee commented 7 months ago

I don't think #44075 will fix this in the IBs since I didn't remove 2023 wfs but added 2024 ones (if I understood well the issue here). Alternatively to the solution here by @mmusich one could inhibit the *FakeHLT steps for the Alpaka wfs.

mmusich commented 7 months ago

one could inhibit the *FakeHLT steps for the Alpaka wfs.

this assumes that we are (correctly) running the FakeHLT RECO+DQM sequence in the workflows that run a Fake HLT menu, but this is not in general guaranteed nor enforced (even though we've been trying to be diligent with it). On the other hand since all the customization thing will get reabsorbed soon, I guess it's an academic discussion. I would open a PR now with https://github.com/cms-sw/cmssw/issues/44119#issuecomment-1966489369 to get rid of failures for the next few weeks and be done with it.

AdrianoDee commented 7 months ago

Alternatively to the solution here by @mmusich one could inhibit the *FakeHLT steps for the Alpaka wfs.

Ok, on a second thought this could overcomplicate things. Would protect the customizer with the failsafes.

AdrianoDee commented 7 months ago

this assumes that we are (correctly) running the FakeHLT RECO+DQM sequence in the workflows that run a Fake HLT menu, but this is not in general guaranteed nor enforced (even though we've been trying to be diligent with it). On the other hand since all the customization thing will get reabsorbed soon, I guess it's an academic discussion.

Agreed, you just preceded me.

makortel commented 7 months ago

+heterogeneous

mmusich commented 7 months ago

+hlt