Closed amaltaro closed 4 years ago
The rate has spiked again since Brunel went back to production federation, see https://ggus.eu/index.php?mode=ticket_info&ticket_id=146123 Nevertheless the rate is nonzero at other times. We need the special error code so we can catch this in the act, and also need to ask all redirector operators in the chain for further logs to properly diagnose.
Dear All, Reading through the whole discussion I wonder,
Thanks very much.
Hi @srimanob
@nsmith- Thanks very much for the information. When do you plan to circulate the list of invalidation again? Dec situation worry me a bit, if some of them are Mini or NanoAOD which have been used in analyses already.
Do you mean when to circulate a list of newly-created output files with this issue? I will compile the list right after https://github.com/dmwm/WMCore/issues/9468 is deployed since at that point no further cases should arise.
@nsmith- for the record, once a GUID fix gets deployed to the agents, it will only affect new workflow sandboxes. In other words, workflows already pulled to the agents will not have that fix in their runtime environment.
Hi @amaltaro @todor-ivanov I have not seen any 8034 error so far, despite seeing again some cases of mismatch starting July. Do we know for sure that the feature in #9468 is active in current campaigns? (In particular for merge jobs which are at most risk for creating messed up files due to this issue) wrongfileAAA_timeline2020.pdf One such example: a NanoAOD MC file with run # != 1 https://cmsweb.cern.ch/das/request?instance=prod/global&input=run+file%3D%2Fstore%2Fmc%2FRunIISummer16NanoAODv6%2FDYjetstoee_01234jets_Pt-0ToInf_13TeV-sherpa%2FNANOAODSIM%2FPUMoriond17_Nano25Oct2019_QCDEWNLO_correct_102X_mcRun2_asymptotic_v7-v1%2F30000%2FBBF7367F-ACD2-3947-A27D-24C451DAEDE4.root
@nsmith- According DAS that file was produced with CMSSW_10_2_18, but the enforceGUIDInFileName
got added in 10_2_20.
Given that there is nothing else expected from WMCore, allow me to close this issue. If any of you see a reason to keep it opened for a bit longer, please just comment and reopen it. Thanks everyone that helped us to get this under more control!
Impact of the bug WMAGent
Describe the bug @vlimant reported via slack that this workflow pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301 has had duplicate lumi sections. Even though there are no other workflows (ACDC and alike) writing to the same output datasets. We need to investigate it and find out whether: a) the agent really assigned the same lumi section to multiple jobs b) or was it another problem (maybe something wrong with a merge) c) find the jobs that produced those lumis, check the logs and the FJR
How to reproduce it no clue!
Expected behavior No duplicate lumi sections should ever be created!
Additional context and error message Report from Unified: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-RunIISummer19UL17wmLHEGEN-00493&module=checkor&limit=50&size=1000
Dimas page: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer19UL17wmLHEGEN-00493