Closed belforte closed 8 years ago
Looked at this with @mmascher and had a long discussion about it. The reason for this is that current TW code exits as soon as it finds one block with no usable locations https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DagmanCreator.py#L670 and similar lines below inside the same loop on jobgroups https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DagmanCreator.py#L627-L730
Apparently it was put in as a feature, having in mind (at the time) only the case of a dataset with a full replica at a site only temporarely blacklisted (like all sites but T0 are, in the end) and some partial replicas. We decided (September 2015) that it was better in this case not to submit and make the user try again later (all dataset replicas usually complete very quickly) than to submit to a partial dataset w/o the user being aware. There is ample evidence that a warning at crab status time is ignored, especially in CRAB API era where users do crab status inside a script and only check at job status.
We may want to reconsider that decision now that we have a common situation of a dataset which has (currently) some blocks at T0 only and yet users may like to process the fraction alrady replicated outside.
Need to think on what to do, mostly "what are the actions that will be better communicated to users"
I have come to a conclusion, at least personally:
CRAB should process the available blocks and print this warning:
Dataset processing will be incomplete because N/M blocks are only present at blacklisted site(s) [....]
where N/M and the list of sites [...] should be the proper ones every time.
From conversations last week I think this what most people also like, if not, please comment in here.
@mmascher do you feel like doing the change, or do you want to delegate ?
@jmarra13 or @belforte , do we have a way to test this? A dataset with blocks on different sites that we can blacklist?
Is this enough for you @mmascher ? Seems that crab do something similar before submission, I mean it run a validation over a dataset(is it valid/invalid/production? Where data is hosted? and so on).
I took the same dataset found here https://github.com/dmwm/CRABServer/issues/5241#issuecomment-229445484
that could do. Summarizing that's dataset /SinglePhoton/Run2016B-PromptReco-v2/RECO But is currently not completed nor at CERN nor at KIT[1], so am not sure there are some blocks at T0 only (which is the use case to test) [1] https://cmsweb.cern.ch/das/request?input=site%20dataset%3D/SinglePhoton/Run2016B-PromptReco-v2/RECO&instance=prod/global&idx=0&limit=10
I tried a couple datasets yesterday, but it looks like that after T0->T1 transfers have been unplugged, this is now a more rare use case :-(
OTOH Marco can unblacklist T0 and blacklist KIT, this dataset will do.
Thanks, that will do!
Here is the test, I covered 3 cases:
1 - Dataset was not available at the whitelisted site:
jmsilva@lxplus041 ~/workspace/crabValidation/scripts> for t in /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_47/src/PartialDatasetTest-/_;do crab status $t;done CRAB project directory: /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-1/crab_PartialDatasetTest-1-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F Task name: 160817_092228:jmsilva_crab_PartialDatasetTest-1-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F Task status: SUBMITFAILED Failure message: The CRAB server backend refuses to send jobs to the Grid scheduler. No locations found for dataset '/Tau/Run2016B-PromptReco-v2/MINIAOD'. (or at least for the part of the dataset that passed the lumi-mask and/or run-range selection). Found 126 (out of 126) blocks present only at blacklisted sites. Global CRAB3 blacklist is set([u'T3_US_Rice', u'T3_US_Wisconsin', u'T2_TR_METU', u'T3_US_BU', u'T3_KR_KISTI', u'T2_PK_NCP', u'T3_CH_CERN_CAF', u'T2_RU_PNPI', u'T3_RU_FIAN', u'T3_US_MIT', u'T3_UK_London_UCL', u'T3_US_ANL', u'T3_US_UCD', u'T2_ES_IFCA', u'T3_CO_Uniandes', u'T3_US_Princeton_ARM', u'T3_TW_NTU_HEP', u'T2_MY_SIFIR', u'T3_ES_Oviedo', u'T3_US_NU', u'T3_US_NotreDame', u'T2_RU_SINP', u'T2_CH_CERN_AI', u'T2_RU_ITEP', u'T2_UK_London_Brunel', u'T3_US_UB', u'T3_IN_PUHEP', u'T2_CH_CSCS_HPC', u'T3_IT_Opportunistic', u'T3_US_NEU', u'T3_IT_Napoli', u'T2_EE_Estonia', u'T3_UK_GridPP_Cloud', u'T3_UK_ScotGrid_ECDF', u'T2_CH_CERN_HLT', u'T2_MY_UPM_BIRUNI', u'T2_CH_CERN_Wigner', u'T3_UK_London_RHUL', u'T2_TH_CUNSTDA', u'T3_US_Kansas', u'T3_US_Princeton_ICSE', u'T0_CH_CERN', u'T3_GR_IASA', u'T3_CN_PKU', u'T3_IR_IPM', u'T2_PL_Warsaw', u'T3_RU_MEPhI', u'T2_RU_INR', u'T3_US_JHU', u'T3_BY_NCPHEP', u'T3_US_FSU', u'T3_US_TACC', u'T3_KR_UOS', u'T3_CH_PSI', u'T3_GR_Demokritos']). User whitelist is set(['T2_BR_SPRACE']).
Log file is /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-1/crab_PartialDatasetTest-1-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F/crab.log CRAB project directory: /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-1/crab_PartialDatasetTest-1-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T Task name: 160817_092236:jmsilva_crab_PartialDatasetTest-1-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T Task status: SUBMITFAILED Failure message: The CRAB server backend refuses to send jobs to the Grid scheduler. No locations found for dataset '/Tau/Run2016B-PromptReco-v2/MINIAOD'. (or at least for the part of the dataset that passed the lumi-mask and/or run-range selection). Found 126 (out of 126) blocks present only at blacklisted sites. Global CRAB3 blacklist is set([u'T3_US_Rice', u'T3_US_Wisconsin', u'T2_TR_METU', u'T3_US_BU', u'T3_KR_KISTI', u'T2_PK_NCP', u'T3_CH_CERN_CAF', u'T2_RU_PNPI', u'T3_RU_FIAN', u'T3_US_MIT', u'T3_UK_London_UCL', u'T3_US_ANL', u'T3_US_UCD', u'T2_ES_IFCA', u'T3_CO_Uniandes', u'T3_US_Princeton_ARM', u'T3_TW_NTU_HEP', u'T2_MY_SIFIR', u'T3_ES_Oviedo', u'T3_US_NU', u'T3_US_NotreDame', u'T2_RU_SINP', u'T2_CH_CERN_AI', u'T2_RU_ITEP', u'T2_UK_London_Brunel', u'T3_US_UB', u'T3_IN_PUHEP', u'T2_CH_CSCS_HPC', u'T3_IT_Opportunistic', u'T3_US_NEU', u'T3_IT_Napoli', u'T2_EE_Estonia', u'T3_UK_GridPP_Cloud', u'T3_UK_ScotGrid_ECDF', u'T2_CH_CERN_HLT', u'T2_MY_UPM_BIRUNI', u'T2_CH_CERN_Wigner', u'T3_UK_London_RHUL', u'T2_TH_CUNSTDA', u'T3_US_Kansas', u'T3_US_Princeton_ICSE', u'T0_CH_CERN', u'T3_GR_IASA', u'T3_CN_PKU', u'T3_IR_IPM', u'T2_PL_Warsaw', u'T3_RU_MEPhI', u'T2_RU_INR', u'T3_US_JHU', u'T3_BY_NCPHEP', u'T3_US_FSU', u'T3_US_TACC', u'T3_KR_UOS', u'T3_CH_PSI', u'T3_GR_Demokritos']). User whitelist is set(['T2_BR_SPRACE']).
Log file is /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-1/crab_PartialDatasetTest-1-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T/crab.log
2 - Dataset has only few blocks at the whitelisted site:
CRAB project directory: /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F Task name: 160817_100749:jmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F Grid scheduler: crab3@vocms0112.cern.ch Task status: SUBMITTED Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=jmsilva&table=Mains&pattern=160817_100749%3Ajmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F
Jobs status: running 100.0% (1/1)
No publication information available yet Log file is /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F/crab.log CRAB project directory: /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T Task name: 160817_100756:jmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T Grid scheduler: crab3@vocms0106.cern.ch Task status: SUBMITTED Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=jmsilva&table=Mains&pattern=160817_100756%3Ajmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T
Jobs status: running 100.0% (1/1)
No publication information available yet Log file is /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F-DOC-T/crab.log
3 - Task with a userinput file with files splitted over 2 whitelisted sites:
CRAB project directory: /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F Task name: 160817_102435:jmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F Grid scheduler: crab3@vocms0112.cern.ch Task status: SUBMITTED Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=jmsilva&table=Mains&pattern=160817_102435%3Ajmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F
Jobs status: idle 100.0% (5/5)
No publication information available yet Log file is /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F/crab.log CRAB project directory: /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F-DOC-T Task name: 160817_102444:jmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F-DOC-T Grid scheduler: crab3@vocms0112.cern.ch Task status: SUBMITTED Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=jmsilva&table=Mains&pattern=160817_102444%3Ajmsilva_crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F-DOC-T
Jobs status: idle 80.0% (4/5) running 20.0% (1/5)
No publication information available yet Log file is /afs/cern.ch/work/j/jmsilva/crabValidation/PartialDatasetTest/CMSSW_7_4_7/src/PartialDatasetTest-2/crab_PartialDatasetTest-2-Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F-DOC-T/crab.log
Hi @mmascher, I catch this error on crabserver:
[18/Aug/2016:14:37:54] RESTSQL:qOXNoZfdFkyj release with rollback [18/Aug/2016:14:37:54] RESTSQL:qOXNoZfdFkyj RELEASED cmsweb_analysis_preprod@devdb11 timeout=300 inuse=0 idle=2 [18/Aug/2016:14:38:02] SERVER REST ERROR WMCore.REST.Error.InvalidParameter 9ceae5da41fdced9c3ae7e4279bda9cd (Invalid input parameter) [18/Aug/2016:14:38:02] Traceback (most recent call last): [18/Aug/2016:14:38:02] File "/data/srv/beHG1608b/sw/slc6_amd64_gcc493/cms/crabserver/3.3.1608.proxy_test/lib/python2.7/site-packages/WMCore/REST/Server.py", line 701, in default [18/Aug/2016:14:38:02] return self._call(RESTArgs(list(args), kwargs)) [18/Aug/2016:14:38:02] File "/data/srv/beHG1608b/sw/slc6_amd64_gcc493/cms/crabserver/3.3.1608.proxy_test/lib/python2.7/site-packages/WMCore/REST/Server.py", line 772, in _call [18/Aug/2016:14:38:02] v(apiobj, request.method, api, param, safe) [18/Aug/2016:14:38:02] File "/data/srv/beHG1608b/sw/slc6_amd64_gcc493/cms/crabserver/3.3.1608.proxy_test/lib/python2.7/site-packages/CRABInterface/RESTTask.py", line 38, in validate [18/Aug/2016:14:38:02] validate_str("warning", param, safe, RX_TEXT_FAIL, optional=True) [18/Aug/2016:14:38:02] File "/data/srv/beHG1608b/sw/slc6_amd64_gcc493/cms/crabserver/3.3.1608.proxy_test/lib/python2.7/site-packages/WMCore/REST/Validation.py", line 121, in validate_str [18/Aug/2016:14:38:02] _validate_one(argname, param, safe, _check_str, optional, rx, custom_err) [18/Aug/2016:14:38:02] File "/data/srv/beHG1608b/sw/slc6_amd64_gcc493/cms/crabserver/3.3.1608.proxy_test/lib/python2.7/site-packages/WMCore/REST/Validation.py", line 84, in _validate_one [18/Aug/2016:14:38:02] safe.kwargs[argname] = checker(argname, val, *args) [18/Aug/2016:14:38:02] File "/data/srv/beHG1608b/sw/slc6_amd64_gcc493/cms/crabserver/3.3.1608.proxy_test/lib/python2.7/site-packages/WMCore/REST/Validation.py", line 38, in _check_str [18/Aug/2016:14:38:02] raise InvalidParameter(return_message("Incorrect '%s' parameter" % argname, custom_err)) [18/Aug/2016:14:38:02] InvalidParameter: InvalidParameter 9ceae5da41fdced9c3ae7e4279bda9cd [HTTP 400, APP 302, MSG 'Invalid input parameter', INFO "Incorrect 'warning' parameter", ERR None] [18/Aug/2016:14:38:02] vocms0132.cern.ch 128.142.136.142 "POST /crabserver/preprod/task HTTP/1.1" 400 Bad Request [data: 20893 in 734 out 3036 us ] [auth: OK "/C=BR/O=ANSP/OU=ANSPGrid CA/OU=People/CN=Jadir Marra da Silva" "" ] [ref: "" "CRABClient/0.0.0" ]
there was such a bug long ago. Then it was fixed. Yet it seems to be back. Needs investigation. See: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1809.html