Open sunilUIET opened 7 months ago
cms-bot internal usage
A new Issue was created by @sunilUIET sunil bansal.
@antoniovilela, @makortel, @smuzaffar, @sextonkennedy, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign xpog
New categories assigned: xpog
@vlimant,@hqucms you have been requested to review this Pull request/Issue and eventually sign? Thanks
can one please pull out two files to merge that lead to this failure?
I will let someone from PnR to comment if we can get such list @z4027163
There was an upgrade in dCache and some FNAL files were lost. FNAL is trying to verify what was damaged, meanwhile, I can speed this up by invalidating the replicas that you find
We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput".
Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk
@sunilUIET @vlimant FYI
I tested the merge process in 13_0_13 using python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/84cdf48b-eb83-44e5-8127-e10a940c6ae2.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/85668cee-7f91-496a-8cbe-7aa9c3c75fdb.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/857aff8a-2164-430f-a311-3ea4237842aa.root
then cmsRun -j FrameworkJobReport.xml RunMergeCfg.py
; and could reproduce the crash indeed.
one file has 35 reweighting weights and the other 34 ; hence the unmergability. Is there a way to get the MINI and AOD files that led to the creation of the two files above ?
the size of the weights comes from a regexp over the LHE information : https://github.com/cms-sw/cmssw/blob/CMSSW_13_0_X/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L905C36-L905C52 ; @cms-sw/generators-l2 how can the number of weights vary from one job to the other ?
@vlimant we discussed this with PdmV too, now, we will have a look how this happened, a priori, I have no idea as this was not the case before. Will report back asap.
@menglu21 FYI.
is it possible that the MG fails to compute on of the weight set and therefore does not include it for one specific job ? leading to a file, from that job, with only a subset of the weights ; files that cannot be merge with others later on because of the size difference.
@bbilin @menglu21 do you have any news on the issue? Number of affected WFs is increasing, so we need to understand as soon as possible for the fix.
Thanks
Hi all, please let me know if you still need those files. We would like to announce this WF if those files are not needed anymore.
@sunilUIET : please provide a list of the samples that exhibit this failure.
@vlimant here is the list (few weeks back) provided by PnR. @z4027163 can add if he has more complete list
Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @cvico @dickychant. @srimanob Do you know if a.) is possible? Anyone else who knows?
Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @Cvico @DickyChant. @srimanob Do you know if a.) is possible? Anyone else who knows?
I think a.) seems more important because I just opened an error log 1 which seems to suggest that the error happens at merging nanoaod step. Do we expect this is due to some missing weights?
Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.
Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.
Exactly
A minor question: do we expect this to be reproducible also at NanoGEN level in case we lose the seed and have to start over?
I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.
I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.
Hope we don't need either ways!
Hi I discussed with @hqucms and checked the corresponding MiniAOD dataset.
So for those files, if we run standard nanov12 sequence from CMSSW_13_0_13, we could already pick up some files that seem to be good (give 35 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/0ec39554-e050-4fdf-96dd-b143efb9cdd2.root
) and some files that seem to be bad (give 34 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/90f2a55d-a0cc-43b8-be1f-f51594bdc950.root
).
Taking these two files as examples, if we retrieve the miniaod files and check the LHE Run Info head
# With fwlite
lhehandle = Handle("LHERunInfoProduct")
test_run.getByLabel("externalLHEProducer",lhehandle)
lheruninfo = lhehandle.product() # here you get a list of strings that forms the `XML` LHE header
The relevant output (i.e. the part with reweighting weights) are
good
file:
<weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
</weight>
</weightgroup>
bad
file:
<weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
</weight>
</weightgroup>
Let me pick up the one line that has difference:
good
: <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05 set param_card dim62f 23 1.0 # orig: 1e-05 </weight>
bad
: <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>
It is obvious that both are good xml
syntax, while unfortunately our GEN weight Nano parser only accepts good
one :)
We need to understand why this happens (@agrohsje and @sihyunjeon please correct me but I don't think this happens without reweighting launch
names...?) I might need to look at madgraph source code to understand...
But what also obvious is: the very first line from both files are also not parsable to our Nano weight parser :) Therefore if you counting the weights there are 36 values while we just have either 35 or 34 entries after NANO them.
If we look at the card, at least I think it should have 36 weights.
Cannot reproduce the MG5 header with the same random seed and nevents from the bad
file....
hey interesting finding. just based on quick scanning, maybe this part https://github.com/mg5amcnlo/mg5amcnlo/blob/LTS/models/check_param_card.py#L572-L575 is not working as expected?
For the first lines that we are dropping, at least in madgraph internal it works as it is encoded - as the parameters are the same as the original values set in customization card it prints out nothing. But what is weird is the difference between good and bad file ...
On a separate note, though, that tWZ sample should've not been submitted in the first place since madspin+reweight was already found to be wonky IIRC (will open an issue on this in https://github.com/cms-sw/genproductions)
python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file
Actually the issue itself is more tricky than this as the relevant bits are:
For which you clearly see that it is supposed to be always producing <weight> </weight>
syntax. (v265 has slightly different code content but what has been done there is similar, one can easily check this out from untar the gridpack and check this file in the mg5basesdir
)
I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.
To me, the quicker (and uglier) solution is to fix the regex
pattern we've been using (I don't know if this is a fix because from madgraph source code one would never expect there could be another possible output syntax).
The better solution that works for long term is to leverage existing xml
parser without reinventing the wheel. (like what we did for LHEInterface and Kenneth's PR on refactoring genweighttable if I don't remember things wrongly?)
For which you clearly see that it is supposed to be always producing
syntax.
So somewhere this /weight>
is getting dropped and making />
which i don't understand...
I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.
Yes that's why i said it's a "separate note"
A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.
A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.
(1): I chatted with @hqucms and we both just thought about running with published miniaods (the published miniaod dataset has ~ 1M events, while the corresponding nano is just 10k so we believed there are buggy files and luckily there are some) I just did condor jobs that runs standard nano sequence and check the merge compatibility after having the nano files and pick up the miniaod that gives good and bad nano output lol
(2): The seed and number of events I got is from the header! Since madgraph running would store the run_card
in the header of LHE files.
(3): Unfortuanately no and I cannot reproduce anything it seems... But I might omit something... I do have the feeling that I did see similar error again but once I modify the mgbasedir
codes to verify my hypothesis on the functional part the error disappeared...
hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo
in this twz sample
hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block
ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo
in this twz sample
Sth I have in mind but didn't want to check at 3am one day which is 1.5 yr before graduation :)
hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block
ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo
in this twz sampleSth I have in mind but didn't want to check at 3am one day which is 1.5 yr before graduation :)
But i think i can test it out today...
Fair enough. We need to understand how to get the info from the system. But ok, if re-launching privately worked, that is good (for this case). This not being able to reproduce is really hitting us. We had the same in the past. If we keep failing, we should reach out to O&C. They should point us to the relevant people so we can discuss.
We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput".
Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk
@sunilUIET @vlimant FYI
can we have the production log of one of them
We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput". Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk @sunilUIET @vlimant FYI
can we have the production log of one of them
The issue should be as early as wmlhe part and i guess we dont gain much
yes, I mean is it possible to get the log of the full chain, e.g., in https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/8001/TOP-Run3Summer22wmLHEGS-00042_0/82bedd7f-b31c-4955-989a-0c85a4445380-615-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log we can see the history of lhe step
We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput". Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk @sunilUIET @vlimant FYI
can we have the production log of one of them
my previous comment seemed not being sent: I think what needed is wmLHE step and I am afraid we cannot gain much from having those logs since there is no enough printing out
i am not sure. i mean we look for something that is not expected. so maybe there is something in the logs of wmlhe.
Hi all,
FYI, all the logs of this example TOP WF can be found under eos. (Note this will be gone after the WF is announced)
/eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/
The wmLHE log is under:
/eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/TOP-Run3Summer22wmLHEGS-00042_0
Best, Zhangqier Wang P&R Team
Sorry if that was not clear. It is not about where to find logs. The point I need to know: If I know that the problem appears in this log : /eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput/cmsgwms-submit4.fnal.gov-1217965-0-log.tar.gz how can I get the log of the corresponding file in wmLHE? Is the connection somewhere stored so I can go back in time? I cannot browse all logs.
this production seems buggy, cmsgwms-submit4.fnal.gov-1173286-0-log.tar.gz, see the log after untarring, job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log, "launch --rwgt_name=ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p \x1b[1;31mCommand "reweight /srv/job/WMTaskSpace/cmsRun1/lheevent/process/Events/cmsgrid/events.lhe.gz_0.lhe -from_cards --multicore=create" interrupted with error: KeyError : \'id (15,) is not in dim64f2l", but not sure they are the same issue. i'm testing using the same random seed to see whether this can be reproduced or not. @z4027163 is it possible to find the corresponding root file in DAS of job cmsgwms-submit4.fnal.gov-1173286-0-log.tar.gz
Thanks @menglu21 for taking the time. Really great. @z4027163 if there is a way to find the logs for a single sample all the way from wmLHEGEN up to NANO it would still be great to know.
thanks for the progress! please keep the steam on towards figuring out the issue
@agrohsje I don't think there is a direct way to check it. I have talked with Meng offline, the only option I could think of is to get the Run Lumi Event number from the LHEGen log, and proceed from there, such as checking the corresponding files in DBS.
Hi all,
I was talking to Josh Bendavid who encountered the similar error. It seems this happens when one of the processes (the process to add weights) fails, the file is simply not updated instead of failing the wmlhe job. He said he resolved it by checking the exit code in the generator and fails the job if the exit code is not 0 by adding "set -e" in the script.
I am wondering if you can confirm that it is the same problem.
Best, Zhangqier Wang P&R team
Hi
Do we know if this is a problem from physics side (reweight values not being computed properly) or techinical side (like adding the reweight value to the lhe files taking too long so it gives up and just moves on without adding the weight)?
I don't know, but presumably having the wmLHE jobs fail directly when this occurs would make debugging and reproducing much much easier either way.
I think it is likely from technical issue:
Couldn't we make use of the existing MiniAODs to perform some check? it just needs printing out things..................... I am away in US this week for a conf so won't be able to perform a check until next week
The check could be simply checking if the numbers from that "failing point" is reasonable, e.g. numerically identical to another point nearby, and also check whether all buggy files are because of one particular point.
Never the less, just want to remind you guys that we actually didn't have this LHEReweightingWeights
for UL production for a long while lol
I agree with Josh. We should add this as an additional check in rucmsgrid.sh. Like we do with XML checks.
checking GEN-Run3Summer22EEwmLHEGS-00563, is this really the same error or just happen to have the same error code 8001?
Asking because I am looking at the error report page. For this request there seems to be no NANO
issue but rather GS
issue and the detailed log tells that it is purely file I/O...
Can production side double check if I am wrong?
Hi,
Recently, we have been observing failures at Nano step with many MC Production WFs as
An exception of category 'LogicError' occurred while [0] Calling InputSource::readRun_ Exception Message: Trying to merge LHEScaleSumw with LHEScaleSumw failed the compatibility test.
Failure is random and the percentage of failure varies across WFs, sometimes more 10-20%. Example WFs are
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_TOP-Run3Summer22wmLHEGS-00027 https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-Run3Summer22EEwmLHEGS-00330