dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Refresh test json templates due to removal of input data #11860

Open amaltaro opened 9 months ago

amaltaro commented 9 months ago

Impact of the bug WMCore validation in general

Describe the bug As we get started with the HG2401 / WMAgent 2.2.6 validation, there are many workflows getting stuck in assigned status. Checking MSTransferor logs, one can see that many calls to Rucio are not yielding any results, meaning that data has been completely removed from the grid [1].

How to reproduce it Inject the relevant test json templates

Expected behavior Matching those datasets against our test json templates, suggest that the following templates need to be remade/refactored because the RelVal data is no longer available: test/data/ReqMgr/requests/Integration/SC_ReDigi_Harvest_Prod.json test/data/ReqMgr/requests/Integration/SC_PY3_PURecyc.json test/data/ReqMgr/requests/Integration/TaskChain_PUMCRecyc.json

and for the non-relval data that has been removed (e.g. DQMIO), the following needs to be remade: test/data/ReqMgr/requests/DMWM/DQMHarvest_RunWhitelist.json test/data/ReqMgr/requests/Integration/DQMHarvesting_MultiRun.json test/data/ReqMgr/requests/Integration/DQMHarvesting.json test/data/ReqMgr/requests/Integration/DQMHarvesting_LumiMask.json

Additional context and error message [1] Relevant log from MStransferor

2024-01-14 00:01:55,786:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,786:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValZMM_14/CMSSW_12_0_0_pre6-120X_mcRun3_2021_realistic_v4-v1/GEN-SIM. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValZMM_14/CMSSW_12_0_0_pre6-120X_mcRun3_2021_realistic_v4-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,789:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,798:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /NoBPTX/Run2016F-23Sep2016-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/NoBPTX/Run2016F-23Sep2016-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,799:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /BTagMu/Run2022D-10Dec2022-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/BTagMu/Run2022D-10Dec2022-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,801:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
amaltaro commented 2 months ago

As I haven't made any progress on this for the last month, I am setting it back to the ToDo queue.

amaltaro commented 1 day ago

Our templates have degraded even further and perhaps half of them are now broken. Most common issues are:

Here is a short summary of workflows (templates) and the problems found during Agent 2.3.7 validation:

amaltaro_SC_6Steps_PU_Agent237_Val_241017_144446_265
RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro_TC_6Tasks_PU_Agent237_Val_241017_144428_6763
RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro_SC_LHE_Ext_Agent237_Val_241017_144920_9028amaltaro_TaskChain_LumiMask_multiRun_Agent237_Val_241017_144914_4026
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc6_amd64_gcc481', 'scramv1', 'CMSSW', 'CMSSW_7_2_0', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 7002
locale::facet::_S_create_c_locale name not valid
[Errno 2] No such file or directory: '/srv/.gwms.d/bin/condor_chirp': '/srv/.gwms.d/bin/condor_chirp'
WARNING: There already exists /srv/job/WMTaskSpace/cmsRun1/CMSSW_9_3_7 area for SCRAM_ARCH slc6_amd64_gcc630.

amaltaro_TaskChain_MC_Agent237_Val_241017_144942_8735
SL6 broken workflow

amaltaro_TaskChain_Prod_Agent237_Val_241017_144944_6207
An exception of category 'NoSecondaryFiles' occurred while
RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro_TC_Drop_Rules_Ext_Agent237_Val_241017_144918_1727
SL6 broken workflow

amaltaro_TC_PY3_Data_LumiList_Agent237_Val_241017_144926_546
An exception of category 'PluginLibraryLoadError' occurred while

amaltaro_TC_PY3_TTbarPU_Agent237_Val_241017_144950_8787
An exception of category 'NoSecondaryFiles' occurred while
RootEmbeddedFileSequence no input files specified for secondary input source.

We will have to work on it ASAP.