cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

DQM/Integration unit tests are failing in all releases but 12_6_X #39669

Closed perrotta closed 1 year ago

perrotta commented 2 years ago

DQM/Integration unit tests are failing in large number in all releases but 12_6_X, in all cases apparently independently from the PR merged in the meanwhile.

I observed it starting in: CMSSW_12_5_X_2022-10-04-1100 CMSSW_12_4_X_2022-10-03-2300 CMSSW_12_3_X_2022-09-30-1100 CMSSW_12_2_X_2022-10-03-2300

No such issue (yet?) in the master release. In all cases there were no PR merged for th IB when it appeared first, in particular we are not merging anything in 12_2_X and 12_3_X since a while.

A typical log:

edmFileUtil --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd --events /store/express/Commissioning2021/ExpressCosmics/FEVT/Express-v1/000/344/518/00000/8ae6d6f6-7859-4089-84dd-4a5d89deb5df.root | tail -n +9 | head -n -5 | awk '{ print $3 }'
Error in <TNetXNGFile::Open>: [ERROR] Server responded with an error: [3011] No servers are available to read the file.

----- Begin Fatal Exception 30-Sep-2022 12:04:01 CEST-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py
Exception Message:
 unknown python problem occurred.
IndexError: list index out of range

At:
  /cvmfs/cms-ib.cern.ch/nweek-02752/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_X_2022-09-30-1100/python/DQM/Integration/config/unittestinputsource_cfi.py(107): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
  /cvmfs/cms-ib.cern.ch/nweek-02752/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_X_2022-09-30-1100/python/FWCore/ParameterSet/Config.py(722): load
  ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py(36): <module>

----- End Fatal Exception -------------------------------------------------
cmsbuild commented 2 years ago

A new Issue was created by @perrotta Andrea Perrotta.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

perrotta commented 2 years ago

assign dqm,externals

cmsbuild commented 2 years ago

New categories assigned: dqm,externals

@jfernan2,@ahmad3213,@micsucmed,@iarspider,@rvenditti,@smuzaffar,@emanueleusai,@syuvivida,@aandvalenzuela,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

iarspider commented 1 year ago

I have reproduced the issue, also with CMSSW_12_6_X_2022-10-04-1100 (no idea why it didn't fail in the IBs). However I don't know how to fix it, we need to wait until @smuzaffar is back.

rvenditti commented 1 year ago

For the time being, I just reproduced the error in CMSSW_12_3_X_2022-09-30-1100 (after changing the input dataset in https://github.com/cms-sw/cmssw/blob/master/DQM/Integration/python/config/unittestinputsource_cfi.py#L41 to avoid the xrootd error), but we don't have any ideas of the reason why. I tried to run a couple of DQM clients without unit test, and they work properly.

smuzaffar commented 1 year ago

Could it be that dataset /ExpressCosmics/Commissioning2021-Express-v1/FEVT was recently deleted and now xrootd can not find such file any more? Note that we have cached this files in ibeos area but one need to use protocol=ibeos to access it e.g. the following works

edmFileUtil  --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

Or other solution is to backport the SITECONFIG_PATH changes https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1249494617 to production releases e.g. 12.x/11.x release cycles.

smuzaffar commented 1 year ago

@makortel , @nhduongvn , @stlammel during Core SW meeting we decided to backport https://github.com/cms-sw/cmssw/pull/37278 changes to older release cycles too. Do you see any issues doing this ? I am not sure if all sites are ready and already have new data catalogs from rucio

makortel commented 1 year ago

during Core SW meeting we decided to backport #37278 changes to older release cycles too. Do you see any issues doing this ?

Yes, that is the plan (see https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1074259843).

Do you see any issues doing this ?

We need to be sure that the backports won't cause troubles in the old release cycles. I had earlier collected the list of fixes that need to be included in the backport in https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1249494617, and this week a new issue on the subsite treatment in the site-local-config.xml was reported in https://cms-talk.web.cern.ch/t/crab-test-cmssw-12-6-x-invalid-site-local-config/15423/17. I've understood @nhduongvn would open a PR for the fix soon.

I am not sure if all sites are ready and already have new data catalogs from rucio

That was actually my precondition for signing #37278 that @stlammel confirmed in https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1115299198 (although with 12_6_0_pre2 reality turned out to be more complicated).

stlammel commented 1 year ago

So, there was a campaign earlier this year to get storage.json files in place for all sites. Two sites had held out and they were put in place when this was discovered several week ago, as Matti wrote. During the sub-site issue last week i found obsolete entries at two sites and they were corrected. The SAM test to check SITECONF is ready and will go into production with the next token update. This should detect inconsistencies before users. (I didn't regard this high priority as we don't have this for the current SITECONF files either but them being active reveals issues promptly.) I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.

rappoccio commented 1 year ago

Hi, All,

This still needs attention, is it still the case that @nhduongvn is preparing a fix here?

nhduongvn commented 1 year ago

Hi Sal, all, The fix was provided and merged: https://github.com/cms-sw/cmssw/pull/39727

rappoccio commented 1 year ago

Thanks @nhduongvn, but we still need back ports to 12_5 and 12_4. @makortel is there some update there?

Otherwise, can we just move to a more recent file for the DQM checks and bypass this entirely to just use a more recent run that's still available? @cms-sw/dqm-l2 ?

rappoccio commented 1 year ago

I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.

@stlammel we won't release 12_6 until December, we can't really leave the IBs broken for 2 months.

stlammel commented 1 year ago

Hallo Sal, @rappoccio i am a bit confused: The old versions, including 12_4, 12_5, should work fine without the backport. Only the 12_6 pre-releases are broken and the next pre-release will fix this. Thanks,

makortel commented 1 year ago

Given the trouble we've had with https://github.com/cms-sw/cmssw/pull/37278 I'm not comfortable in backporting it (and all the necessary fixes) to 12_4_X or 12_5_X until the data taking is over (to avoid any risk for Tier0).

Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266 . @smuzaffar The test machinery still sets CMS_PATH=/cvmfs/cms-ib.cern.ch, right? If that is the case, edmFileUtil will find the right storage.xml. I just tested

CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil  --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

succeeds in CMSSW_12_5_X_2022-10-21-1100.

mmusich commented 1 year ago

Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266 .

based on my private test[^1], this won't be sufficient to fix the unit tests.

[^1]: cmsrel CMSSW_12_5_X_2022-10-21-1100 cd CMSSW_12_5_X_2022-10-21-1100/src/ cmsenv git cms-addpkg DQM/Integration git cherry-pick 9a056d437411de96fc23edd6948539c0fbe0d166 scramv1 b -j 20 cd DQM/Integration/python/clients/ voms-proxy-init -voms cms cmsRun sistrip_dqm_sourceclient-live_cfg.py unitTest=True

smuzaffar commented 1 year ago

Right, dropping the --catelog option does not work for 12.5 and earliler releases. One simple fix is to either use a file known to das ( acessiable via xrootd redirectors ) or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos

mmusich commented 1 year ago

or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos

this indeed works. I have opened the following PRs:

Let me know if some other cycles could use an update.

makortel commented 1 year ago

I still don't understand why just dropping the --catalog would not work. In CMSSW_12_5_X_2022-10-21-1100 I get

# this is what the test used before
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://cms-xrd-global.cern.ch//store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

# with explicit ibeos
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

# dropping --catalog, setting CMS_PATH
$ CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil -d /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

The last two cases resolve to exactly the same PFN.

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

Anyway, given that https://github.com/cms-sw/cmssw/pull/39829 and https://github.com/cms-sw/cmssw/pull/39830 are already merged, there probably isn't practical need to continue the discussion (except maybe why the merge of #39829 did not cause this issue to close).

mmusich commented 1 year ago

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

This didn't work for me, see https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943

makortel commented 1 year ago

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

This didn't work for me, see #39669 (comment)

I guess because the recipe in https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943 did not include overriding the CMS_PATH (that I expect scram b use-ibeos runtests to do, among other things).

smuzaffar commented 1 year ago

humm, yes dropping --catalog with correct CMS_PATH also worked for me .... no idea why I had the impression that this was not working.

mmusich commented 1 year ago

no idea why I had the impression that this was not working.

that's interesting, because when I first tried to drop --catalog (, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266) also I have the distinct impression that also scram b use-ibeos runtests wasn't working, then I passed to use single client tests (as in the recipe of https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943) in order to make tests run faster. I am wondering if some other thing was changed in the meanwhile, such that scram b use-ibeos runtests now also runs OK. At any rate I think that https://github.com/cms-sw/cmssw/pull/39829 is a superior fix, because other than letting the unit test run, also allows the single client to be tested in unit test mode directly, which is what generally developers use.

rappoccio commented 1 year ago

Thanks a lot for the efforts here! I think we can now close the issue as the IBs are now correctly completing. Thanks everyone!