Open amaltaro opened 9 months ago
As I haven't made any progress on this for the last month, I am setting it back to the ToDo queue.
Our templates have degraded even further and perhaps half of them are now broken. Most common issues are:
Here is a short summary of workflows (templates) and the problems found during Agent 2.3.7 validation:
amaltaro_SC_6Steps_PU_Agent237_Val_241017_144446_265
RootEmbeddedFileSequence no input files specified for secondary input source.
amaltaro_TC_6Tasks_PU_Agent237_Val_241017_144428_6763
RootEmbeddedFileSequence no input files specified for secondary input source.
amaltaro_SC_LHE_Ext_Agent237_Val_241017_144920_9028amaltaro_TaskChain_LumiMask_multiRun_Agent237_Val_241017_144914_4026
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc6_amd64_gcc481', 'scramv1', 'CMSSW', 'CMSSW_7_2_0', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 7002
locale::facet::_S_create_c_locale name not valid
[Errno 2] No such file or directory: '/srv/.gwms.d/bin/condor_chirp': '/srv/.gwms.d/bin/condor_chirp'
WARNING: There already exists /srv/job/WMTaskSpace/cmsRun1/CMSSW_9_3_7 area for SCRAM_ARCH slc6_amd64_gcc630.
amaltaro_TaskChain_MC_Agent237_Val_241017_144942_8735
SL6 broken workflow
amaltaro_TaskChain_Prod_Agent237_Val_241017_144944_6207
An exception of category 'NoSecondaryFiles' occurred while
RootEmbeddedFileSequence no input files specified for secondary input source.
amaltaro_TC_Drop_Rules_Ext_Agent237_Val_241017_144918_1727
SL6 broken workflow
amaltaro_TC_PY3_Data_LumiList_Agent237_Val_241017_144926_546
An exception of category 'PluginLibraryLoadError' occurred while
amaltaro_TC_PY3_TTbarPU_Agent237_Val_241017_144950_8787
An exception of category 'NoSecondaryFiles' occurred while
RootEmbeddedFileSequence no input files specified for secondary input source.
We will have to work on it ASAP.
Impact of the bug WMCore validation in general
Describe the bug As we get started with the HG2401 / WMAgent 2.2.6 validation, there are many workflows getting stuck in
assigned
status. Checking MSTransferor logs, one can see that many calls to Rucio are not yielding any results, meaning that data has been completely removed from the grid [1].How to reproduce it Inject the relevant test json templates
Expected behavior Matching those datasets against our test json templates, suggest that the following templates need to be remade/refactored because the RelVal data is no longer available: test/data/ReqMgr/requests/Integration/SC_ReDigi_Harvest_Prod.json test/data/ReqMgr/requests/Integration/SC_PY3_PURecyc.json test/data/ReqMgr/requests/Integration/TaskChain_PUMCRecyc.json
and for the non-relval data that has been removed (e.g. DQMIO), the following needs to be remade: test/data/ReqMgr/requests/DMWM/DQMHarvest_RunWhitelist.json test/data/ReqMgr/requests/Integration/DQMHarvesting_MultiRun.json test/data/ReqMgr/requests/Integration/DQMHarvesting.json test/data/ReqMgr/requests/Integration/DQMHarvesting_LumiMask.json
Additional context and error message [1] Relevant log from MStransferor