cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Failures at Nano LHEScaleSumw merge failed compatibility #43784

Open sunilUIET opened 7 months ago

sunilUIET commented 7 months ago

Hi,

Recently, we have been observing failures at Nano step with many MC Production WFs as


An exception of category 'LogicError' occurred while [0] Calling InputSource::readRun_ Exception Message: Trying to merge LHEScaleSumw with LHEScaleSumw failed the compatibility test.


Failure is random and the percentage of failure varies across WFs, sometimes more 10-20%. Example WFs are

https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_TOP-Run3Summer22wmLHEGS-00027 https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-Run3Summer22EEwmLHEGS-00330

cmsbuild commented 7 months ago

cms-bot internal usage

cmsbuild commented 7 months ago

A new Issue was created by @sunilUIET sunil bansal.

@antoniovilela, @makortel, @smuzaffar, @sextonkennedy, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 7 months ago

assign xpog

cmsbuild commented 7 months ago

New categories assigned: xpog

@vlimant,@hqucms you have been requested to review this Pull request/Issue and eventually sign? Thanks

vlimant commented 7 months ago

can one please pull out two files to merge that lead to this failure?

sunilUIET commented 7 months ago

I will let someone from PnR to comment if we can get such list @z4027163

amanrique1 commented 7 months ago

There was an upgrade in dCache and some FNAL files were lost. FNAL is trying to verify what was damaged, meanwhile, I can speed this up by invalidating the replicas that you find

z4027163 commented 7 months ago

We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput".

Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk

@sunilUIET @vlimant FYI

vlimant commented 7 months ago

I tested the merge process in 13_0_13 using python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/84cdf48b-eb83-44e5-8127-e10a940c6ae2.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/85668cee-7f91-496a-8cbe-7aa9c3c75fdb.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/857aff8a-2164-430f-a311-3ea4237842aa.root then cmsRun -j FrameworkJobReport.xml RunMergeCfg.py ; and could reproduce the crash indeed.

vlimant commented 7 months ago

one file has 35 reweighting weights and the other 34 ; hence the unmergability. Is there a way to get the MINI and AOD files that led to the creation of the two files above ?

the size of the weights comes from a regexp over the LHE information : https://github.com/cms-sw/cmssw/blob/CMSSW_13_0_X/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L905C36-L905C52 ; @cms-sw/generators-l2 how can the number of weights vary from one job to the other ?

bbilin commented 7 months ago

@vlimant we discussed this with PdmV too, now, we will have a look how this happened, a priori, I have no idea as this was not the case before. Will report back asap.

@menglu21 FYI.

vlimant commented 7 months ago

is it possible that the MG fails to compute on of the weight set and therefore does not include it for one specific job ? leading to a file, from that job, with only a subset of the weights ; files that cannot be merge with others later on because of the size difference.

sunilUIET commented 6 months ago

@bbilin @menglu21 do you have any news on the issue? Number of affected WFs is increasing, so we need to understand as soon as possible for the fix.

Thanks

z4027163 commented 6 months ago

Hi all, please let me know if you still need those files. We would like to announce this WF if those files are not needed anymore.

vlimant commented 6 months ago

@sunilUIET : please provide a list of the samples that exhibit this failure.

sunilUIET commented 6 months ago

@vlimant here is the list (few weeks back) provided by PnR. @z4027163 can add if he has more complete list

agrohsje commented 6 months ago

Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @cvico @dickychant. @srimanob Do you know if a.) is possible? Anyone else who knows?

DickyChant commented 6 months ago

Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @Cvico @DickyChant. @srimanob Do you know if a.) is possible? Anyone else who knows?

I think a.) seems more important because I just opened an error log 1 which seems to suggest that the error happens at merging nanoaod step. Do we expect this is due to some missing weights?

agrohsje commented 6 months ago

Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.

DickyChant commented 6 months ago

Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.

Exactly

A minor question: do we expect this to be reproducible also at NanoGEN level in case we lose the seed and have to start over?

agrohsje commented 6 months ago

I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.

DickyChant commented 6 months ago

I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.

Hope we don't need either ways!

DickyChant commented 6 months ago

Hi I discussed with @hqucms and checked the corresponding MiniAOD dataset.

So for those files, if we run standard nanov12 sequence from CMSSW_13_0_13, we could already pick up some files that seem to be good (give 35 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/0ec39554-e050-4fdf-96dd-b143efb9cdd2.root) and some files that seem to be bad (give 34 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/90f2a55d-a0cc-43b8-be1f-f51594bdc950.root).

Taking these two files as examples, if we retrieve the miniaod files and check the LHE Run Info head

# With fwlite
lhehandle = Handle("LHERunInfoProduct")
test_run.getByLabel("externalLHEProducer",lhehandle)
lheruninfo = lhehandle.product() # here you get a list of strings that forms the `XML` LHE header

The relevant output (i.e. the part with reweighting weights) are

  1. For good file:
    <weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 15 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 15 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 13 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
    </weight>
    </weightgroup>
  2. For bad file:
    <weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    set param_card dim62f 22 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 15 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 15 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 13 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 23 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 19 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
    set param_card dim62f 24 1.0 # orig: 1e-05
    </weight>
    <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
    </weight>
    </weightgroup>

    Let me pick up the one line that has difference:

  3. good: <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05 set param_card dim62f 23 1.0 # orig: 1e-05 </weight>
  4. bad: <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>

It is obvious that both are good xml syntax, while unfortunately our GEN weight Nano parser only accepts good one :)

Relevant code: https://github.com/cms-sw/cmssw/blob/b4572d430a07a0a38f665556c54b7e87379065db/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L595

We need to understand why this happens (@agrohsje and @sihyunjeon please correct me but I don't think this happens without reweighting launch names...?) I might need to look at madgraph source code to understand...

But what also obvious is: the very first line from both files are also not parsable to our Nano weight parser :) Therefore if you counting the weights there are 36 values while we just have either 35 or 34 entries after NANO them.

If we look at the card, at least I think it should have 36 weights.

DickyChant commented 6 months ago

Cannot reproduce the MG5 header with the same random seed and nevents from the bad file....

sihyunjeon commented 6 months ago

hey interesting finding. just based on quick scanning, maybe this part https://github.com/mg5amcnlo/mg5amcnlo/blob/LTS/models/check_param_card.py#L572-L575 is not working as expected?

For the first lines that we are dropping, at least in madgraph internal it works as it is encoded - as the parameters are the same as the original values set in customization card it prints out nothing. But what is weird is the difference between good and bad file ...

On a separate note, though, that tWZ sample should've not been submitted in the first place since madspin+reweight was already found to be wonky IIRC (will open an issue on this in https://github.com/cms-sw/genproductions)

DickyChant commented 6 months ago

python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file

Actually the issue itself is more tricky than this as the relevant bits are:

https://github.com/mg5amcnlo/mg5amcnlo/blob/59b4b9c1238978f39a32b8bc83244328187704b6/madgraph/interface/reweight_interface.py#L870C1-L883C80

For which you clearly see that it is supposed to be always producing <weight> </weight> syntax. (v265 has slightly different code content but what has been done there is similar, one can easily check this out from untar the gridpack and check this file in the mg5basesdir)

I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.

To me, the quicker (and uglier) solution is to fix the regex pattern we've been using (I don't know if this is a fix because from madgraph source code one would never expect there could be another possible output syntax).

The better solution that works for long term is to leverage existing xml parser without reinventing the wheel. (like what we did for LHEInterface and Kenneth's PR on refactoring genweighttable if I don't remember things wrongly?)

sihyunjeon commented 6 months ago

For which you clearly see that it is supposed to be always producing syntax.

So somewhere this /weight> is getting dropped and making /> which i don't understand...

I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.

Yes that's why i said it's a "separate note"

agrohsje commented 6 months ago

A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.

DickyChant commented 6 months ago

A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.

(1): I chatted with @hqucms and we both just thought about running with published miniaods (the published miniaod dataset has ~ 1M events, while the corresponding nano is just 10k so we believed there are buggy files and luckily there are some) I just did condor jobs that runs standard nano sequence and check the merge compatibility after having the nano files and pick up the miniaod that gives good and bad nano output lol (2): The seed and number of events I got is from the header! Since madgraph running would store the run_card in the header of LHE files. (3): Unfortuanately no and I cannot reproduce anything it seems... But I might omit something... I do have the feeling that I did see similar error again but once I modify the mgbasedir codes to verify my hypothesis on the functional part the error disappeared...

sihyunjeon commented 6 months ago

hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo in this twz sample

DickyChant commented 6 months ago

hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo in this twz sample

Sth I have in mind but didn't want to check at 3am one day which is 1.5 yr before graduation :)

DickyChant commented 6 months ago

hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo in this twz sample

Sth I have in mind but didn't want to check at 3am one day which is 1.5 yr before graduation :)

But i think i can test it out today...

agrohsje commented 6 months ago

Fair enough. We need to understand how to get the info from the system. But ok, if re-launching privately worked, that is good (for this case). This not being able to reproduce is really hitting us. We had the same in the past. If we keep failing, we should reach out to O&C. They should point us to the relevant people so we can discuss.

menglu21 commented 6 months ago

We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput".

Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk

@sunilUIET @vlimant FYI

can we have the production log of one of them

DickyChant commented 6 months ago

We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput". Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk @sunilUIET @vlimant FYI

can we have the production log of one of them

The issue should be as early as wmlhe part and i guess we dont gain much

menglu21 commented 6 months ago

yes, I mean is it possible to get the log of the full chain, e.g., in https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/8001/TOP-Run3Summer22wmLHEGS-00042_0/82bedd7f-b31c-4955-989a-0c85a4445380-615-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log we can see the history of lhe step

DickyChant commented 6 months ago

We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput". Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk @sunilUIET @vlimant FYI

can we have the production log of one of them

my previous comment seemed not being sent: I think what needed is wmLHE step and I am afraid we cannot gain much from having those logs since there is no enough printing out

agrohsje commented 6 months ago

i am not sure. i mean we look for something that is not expected. so maybe there is something in the logs of wmlhe.

z4027163 commented 6 months ago

Hi all, FYI, all the logs of this example TOP WF can be found under eos. (Note this will be gone after the WF is announced) /eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/

The wmLHE log is under: /eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/TOP-Run3Summer22wmLHEGS-00042_0

Best, Zhangqier Wang P&R Team

agrohsje commented 6 months ago

Sorry if that was not clear. It is not about where to find logs. The point I need to know: If I know that the problem appears in this log : /eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610/TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput/cmsgwms-submit4.fnal.gov-1217965-0-log.tar.gz how can I get the log of the corresponding file in wmLHE? Is the connection somewhere stored so I can go back in time? I cannot browse all logs.

menglu21 commented 6 months ago

this production seems buggy, cmsgwms-submit4.fnal.gov-1173286-0-log.tar.gz, see the log after untarring, job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log, "launch --rwgt_name=ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p \x1b[1;31mCommand "reweight /srv/job/WMTaskSpace/cmsRun1/lheevent/process/Events/cmsgrid/events.lhe.gz_0.lhe -from_cards --multicore=create" interrupted with error: KeyError : \'id (15,) is not in dim64f2l", but not sure they are the same issue. i'm testing using the same random seed to see whether this can be reproduced or not. @z4027163 is it possible to find the corresponding root file in DAS of job cmsgwms-submit4.fnal.gov-1173286-0-log.tar.gz

agrohsje commented 6 months ago

Thanks @menglu21 for taking the time. Really great. @z4027163 if there is a way to find the logs for a single sample all the way from wmLHEGEN up to NANO it would still be great to know.

vlimant commented 5 months ago

thanks for the progress! please keep the steam on towards figuring out the issue

z4027163 commented 5 months ago

@agrohsje I don't think there is a direct way to check it. I have talked with Meng offline, the only option I could think of is to get the Run Lumi Event number from the LHEGen log, and proceed from there, such as checking the corresponding files in DBS.

z4027163 commented 5 months ago

Hi all,

I was talking to Josh Bendavid who encountered the similar error. It seems this happens when one of the processes (the process to add weights) fails, the file is simply not updated instead of failing the wmlhe job. He said he resolved it by checking the exit code in the generator and fails the job if the exit code is not 0 by adding "set -e" in the script.

I am wondering if you can confirm that it is the same problem.

Best, Zhangqier Wang P&R team

sihyunjeon commented 5 months ago

Hi

Do we know if this is a problem from physics side (reweight values not being computed properly) or techinical side (like adding the reweight value to the lhe files taking too long so it gives up and just moves on without adding the weight)?

bendavid commented 5 months ago

I don't know, but presumably having the wmLHE jobs fail directly when this occurs would make debugging and reproducing much much easier either way.

DickyChant commented 5 months ago

I think it is likely from technical issue:

  1. we don't have all of the files failing --> suggest it might be fine
  2. If you look at Meng's report, it clearly stats that reweighting with one particular point is buggy, that actually agrees with what Josh pointed out

Couldn't we make use of the existing MiniAODs to perform some check? it just needs printing out things..................... I am away in US this week for a conf so won't be able to perform a check until next week

The check could be simply checking if the numbers from that "failing point" is reasonable, e.g. numerically identical to another point nearby, and also check whether all buggy files are because of one particular point.

Never the less, just want to remind you guys that we actually didn't have this LHEReweightingWeights for UL production for a long while lol

agrohsje commented 5 months ago

I agree with Josh. We should add this as an additional check in rucmsgrid.sh. Like we do with XML checks.

DickyChant commented 5 months ago

checking GEN-Run3Summer22EEwmLHEGS-00563, is this really the same error or just happen to have the same error code 8001?

Asking because I am looking at the error report page. For this request there seems to be no NANO issue but rather GS issue and the detailed log tells that it is purely file I/O...

Can production side double check if I am wrong?