Closed perrotta closed 1 year ago
A new Issue was created by @perrotta Andrea Perrotta.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign dqm,externals
New categories assigned: dqm,externals
@jfernan2,@ahmad3213,@micsucmed,@iarspider,@rvenditti,@smuzaffar,@emanueleusai,@syuvivida,@aandvalenzuela,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
I have reproduced the issue, also with CMSSW_12_6_X_2022-10-04-1100 (no idea why it didn't fail in the IBs). However I don't know how to fix it, we need to wait until @smuzaffar is back.
For the time being, I just reproduced the error in CMSSW_12_3_X_2022-09-30-1100 (after changing the input dataset in https://github.com/cms-sw/cmssw/blob/master/DQM/Integration/python/config/unittestinputsource_cfi.py#L41 to avoid the xrootd error), but we don't have any ideas of the reason why. I tried to run a couple of DQM clients without unit test, and they work properly.
Could it be that dataset /ExpressCosmics/Commissioning2021-Express-v1/FEVT
was recently deleted and now xrootd can not find such file any more? Note that we have cached this files in ibeos area but one need to use protocol=ibeos
to access it e.g. the following works
edmFileUtil --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
Or other solution is to backport the SITECONFIG_PATH
changes https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1249494617 to production releases e.g. 12.x/11.x
release cycles.
@makortel , @nhduongvn , @stlammel during Core SW meeting we decided to backport https://github.com/cms-sw/cmssw/pull/37278 changes to older release cycles too. Do you see any issues doing this ? I am not sure if all sites are ready and already have new data catalogs from rucio
during Core SW meeting we decided to backport #37278 changes to older release cycles too. Do you see any issues doing this ?
Yes, that is the plan (see https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1074259843).
Do you see any issues doing this ?
We need to be sure that the backports won't cause troubles in the old release cycles. I had earlier collected the list of fixes that need to be included in the backport in https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1249494617, and this week a new issue on the subsite treatment in the site-local-config.xml
was reported in
https://cms-talk.web.cern.ch/t/crab-test-cmssw-12-6-x-invalid-site-local-config/15423/17. I've understood @nhduongvn would open a PR for the fix soon.
I am not sure if all sites are ready and already have new data catalogs from rucio
That was actually my precondition for signing #37278 that @stlammel confirmed in https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1115299198 (although with 12_6_0_pre2 reality turned out to be more complicated).
So, there was a campaign earlier this year to get storage.json files in place for all sites. Two sites had held out and they were put in place when this was discovered several week ago, as Matti wrote. During the sub-site issue last week i found obsolete entries at two sites and they were corrected. The SAM test to check SITECONF is ready and will go into production with the next token update. This should detect inconsistencies before users. (I didn't regard this high priority as we don't have this for the current SITECONF files either but them being active reveals issues promptly.) I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.
Hi, All,
This still needs attention, is it still the case that @nhduongvn is preparing a fix here?
Hi Sal, all, The fix was provided and merged: https://github.com/cms-sw/cmssw/pull/39727
Thanks @nhduongvn, but we still need back ports to 12_5 and 12_4. @makortel is there some update there?
Otherwise, can we just move to a more recent file for the DQM checks and bypass this entirely to just use a more recent run that's still available? @cms-sw/dqm-l2 ?
I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.
@stlammel we won't release 12_6 until December, we can't really leave the IBs broken for 2 months.
Hallo Sal, @rappoccio i am a bit confused: The old versions, including 12_4, 12_5, should work fine without the backport. Only the 12_6 pre-releases are broken and the next pre-release will fix this. Thanks,
Given the trouble we've had with https://github.com/cms-sw/cmssw/pull/37278 I'm not comfortable in backporting it (and all the necessary fixes) to 12_4_X or 12_5_X until the data taking is over (to avoid any risk for Tier0).
Said that, I think the unit tests would get fixed by just dropping the --catalog
option to edmFileUtil
, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266 . @smuzaffar The test machinery still sets CMS_PATH=/cvmfs/cms-ib.cern.ch
, right? If that is the case, edmFileUtil
will find the right storage.xml
. I just tested
CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
succeeds in CMSSW_12_5_X_2022-10-21-1100.
Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266 .
based on my private test[^1], this won't be sufficient to fix the unit tests.
[^1]: cmsrel CMSSW_12_5_X_2022-10-21-1100 cd CMSSW_12_5_X_2022-10-21-1100/src/ cmsenv git cms-addpkg DQM/Integration git cherry-pick 9a056d437411de96fc23edd6948539c0fbe0d166 scramv1 b -j 20 cd DQM/Integration/python/clients/ voms-proxy-init -voms cms cmsRun sistrip_dqm_sourceclient-live_cfg.py unitTest=True
Right, dropping the --catelog
option does not work for 12.5 and earliler releases. One simple fix is to either use a file known to das ( acessiable via xrootd redirectors ) or use ibeos
protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos
or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos
this indeed works. I have opened the following PRs:
Let me know if some other cycles could use an update.
I still don't understand why just dropping the --catalog
would not work. In CMSSW_12_5_X_2022-10-21-1100 I get
# this is what the test used before
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://cms-xrd-global.cern.ch//store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
# with explicit ibeos
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
# dropping --catalog, setting CMS_PATH
$ CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil -d /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
The last two cases resolve to exactly the same PFN.
Also running scram b use-ibeos runtests
on DQM/Integration
seems to work with dropping --catalog
.
Anyway, given that https://github.com/cms-sw/cmssw/pull/39829 and https://github.com/cms-sw/cmssw/pull/39830 are already merged, there probably isn't practical need to continue the discussion (except maybe why the merge of #39829 did not cause this issue to close).
Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.
This didn't work for me, see https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943
Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.
This didn't work for me, see #39669 (comment)
I guess because the recipe in https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943 did not include overriding the CMS_PATH
(that I expect scram b use-ibeos runtests
to do, among other things).
humm, yes dropping --catalog
with correct CMS_PATH also worked for me .... no idea why I had the impression that this was not working.
no idea why I had the impression that this was not working.
that's interesting, because when I first tried to drop --catalog
(, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266) also I have the distinct impression that also scram b use-ibeos runtests
wasn't working, then I passed to use single client tests (as in the recipe of https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943) in order to make tests run faster.
I am wondering if some other thing was changed in the meanwhile, such that scram b use-ibeos runtests
now also runs OK.
At any rate I think that https://github.com/cms-sw/cmssw/pull/39829 is a superior fix, because other than letting the unit test run, also allows the single client to be tested in unit test mode directly, which is what generally developers use.
Thanks a lot for the efforts here! I think we can now close the issue as the IBs are now correctly completing. Thanks everyone!
DQM/Integration unit tests are failing in large number in all releases but 12_6_X, in all cases apparently independently from the PR merged in the meanwhile.
I observed it starting in: CMSSW_12_5_X_2022-10-04-1100 CMSSW_12_4_X_2022-10-03-2300 CMSSW_12_3_X_2022-09-30-1100 CMSSW_12_2_X_2022-10-03-2300
No such issue (yet?) in the master release. In all cases there were no PR merged for th IB when it appeared first, in particular we are not merging anything in 12_2_X and 12_3_X since a while.
A typical log: