dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Single workflow with duplicated lumi sections #9432

Closed amaltaro closed 4 years ago

amaltaro commented 5 years ago

Impact of the bug WMAGent

Describe the bug @vlimant reported via slack that this workflow pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301 has had duplicate lumi sections. Even though there are no other workflows (ACDC and alike) writing to the same output datasets. We need to investigate it and find out whether: a) the agent really assigned the same lumi section to multiple jobs b) or was it another problem (maybe something wrong with a merge) c) find the jobs that produced those lumis, check the logs and the FJR

How to reproduce it no clue!

Expected behavior No duplicate lumi sections should ever be created!

Additional context and error message Report from Unified: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-RunIISummer19UL17wmLHEGEN-00493&module=checkor&limit=50&size=1000

Dimas page: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer19UL17wmLHEGEN-00493

nsmith- commented 4 years ago

The rate has spiked again since Brunel went back to production federation, see https://ggus.eu/index.php?mode=ticket_info&ticket_id=146123 Nevertheless the rate is nonzero at other times. We need the special error code so we can catch this in the act, and also need to ask all redirector operators in the chain for further logs to properly diagnose.

srimanob commented 4 years ago

Dear All, Reading through the whole discussion I wonder,

  1. is the first set of mismatch invalidated (from Dec 9) already?
  2. As @nsmith- @vlimant comments, do you mean we start to face the issue again?

Thanks very much.

nsmith- commented 4 years ago

Hi @srimanob

  1. Yes, as announced in https://hypernews.cern.ch/HyperNews/CMS/get/comp-ops/4792.html
  2. Yes, filemismatch
srimanob commented 4 years ago

@nsmith- Thanks very much for the information. When do you plan to circulate the list of invalidation again? Dec situation worry me a bit, if some of them are Mini or NanoAOD which have been used in analyses already.

nsmith- commented 4 years ago

Do you mean when to circulate a list of newly-created output files with this issue? I will compile the list right after https://github.com/dmwm/WMCore/issues/9468 is deployed since at that point no further cases should arise.

amaltaro commented 4 years ago

@nsmith- for the record, once a GUID fix gets deployed to the agents, it will only affect new workflow sandboxes. In other words, workflows already pulled to the agents will not have that fix in their runtime environment.

nsmith- commented 4 years ago

Hi @amaltaro @todor-ivanov I have not seen any 8034 error so far, despite seeing again some cases of mismatch starting July. Do we know for sure that the feature in #9468 is active in current campaigns? (In particular for merge jobs which are at most risk for creating messed up files due to this issue) wrongfileAAA_timeline2020.pdf One such example: a NanoAOD MC file with run # != 1 https://cmsweb.cern.ch/das/request?instance=prod/global&input=run+file%3D%2Fstore%2Fmc%2FRunIISummer16NanoAODv6%2FDYjetstoee_01234jets_Pt-0ToInf_13TeV-sherpa%2FNANOAODSIM%2FPUMoriond17_Nano25Oct2019_QCDEWNLO_correct_102X_mcRun2_asymptotic_v7-v1%2F30000%2FBBF7367F-ACD2-3947-A27D-24C451DAEDE4.root

makortel commented 4 years ago

@nsmith- According DAS that file was produced with CMSSW_10_2_18, but the enforceGUIDInFileName got added in 10_2_20.

amaltaro commented 4 years ago

Given that there is nothing else expected from WMCore, allow me to close this issue. If any of you see a reason to keep it opened for a bit longer, please just comment and reopen it. Thanks everyone that helped us to get this under more control!