cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Online-DQM occasionally not picking up the correct IOVs ? #45714

Open missirol opened 4 weeks ago

missirol commented 4 weeks ago

There have been cases in recent weeks/months where strange discrepancies were observed in the online-DQM outputs at P5.

In both examples, the discrepancies disappeared after a new run was started.

At face value, both examples seem compatible with the cmsRun jobs in the online-DQM nodes not picking up the latest (and correct) IOVs, using instead older ones and thus leading to mismatches between real and emulated data in DQM outputs.

I think it would be helpful if DQM and AlCa-DB could investigate what happened in these cases (O2O logs, etc), with help from framework experts if needed.

If the issue is not specific to online-DQM, but generally related to the access to the conditions database, it could potentially affect the HLT jobs running online as well.

Maybe unrelated, a recent HLT crash possibly caused by a failure in accessing correct conditions (in that case, for the beamspot) is being discussed in #45555.

cmsbuild commented 4 weeks ago

cms-bot internal usage

cmsbuild commented 4 weeks ago

A new Issue was created by @missirol.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 4 weeks ago

assign dqm, alca, db

cmsbuild commented 4 weeks ago

New categories assigned: dqm,alca,db

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini,@francescobrivio,@saumyaphor4252,@saumyaphor4252,@perrotta,@perrotta,@consuegs,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

missirol commented 3 weeks ago

@cms-sw/dqm-l2 @cms-sw/db-l2 @cms-sw/alca-l2

Will you try to address this issue ?

missirol commented 2 weeks ago

@cms-sw/dqm-l2 @cms-sw/db-l2 @cms-sw/alca-l2

Still wondering if there will be some follow-up. Or the issue is not worth investigating ? Or should more info be provided ?

perrotta commented 2 weeks ago

@cms-sw/dqm-l2 @cms-sw/db-l2 @cms-sw/alca-l2

Still wondering if there will be some follow-up. Or the issue is not worth investigating ? Or should more info be provided ?

@missirol if the online-DQM jobs consumes conditions from an older IOV I think is an issue rooted in the online-DQM jobs. If needed (and if I can) I can help debugging, but @cms-sw/dqm-l2 should pinpoint first which are those jobs, where the issue could come from, etc.

missirol commented 1 week ago

if the online-DQM jobs consumes conditions from an older IOV I think is an issue rooted in the online-DQM jobs.

How can we be sure that this only affects the online-DQM [*] ? Could it be that the online-DQM is just the first (and only ?) place where such an issue would be spotted ?

In the cases given in the description, was anything strange noticed on the DB side and/or in the O2O logs ? (I understood in https://github.com/cms-sw/cmssw/issues/45555#issuecomment-2293642146 that O2O logs get eventually deleted, so maybe now it's too late to check). @cms-sw/db-l2

[*] From the description

If the issue is not specific to online-DQM, but generally related to the access to the conditions database, it could potentially affect the HLT jobs running online as well.