LHE weights not recognized by GenWeightsTableProducer

matt-komm commented 4 years ago

Hi,

I am trying to produce nanoaod from a private 2016 miniaodv3 signal sample produced with MadGraph2.6.5. The https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc does not seem to pick up the scale and reweighting weights correctly. In particular the following regex' fail:

std::regex weightgroupRwgt("<weightgroup\\s+(?:name|type)=\"(.*)\"\\s*>"); matches <weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName"> but the outer most quotes are being captured resulting in mg_reweighting" weight_name_strategy="includeIdInWeightName being interpreted as groupname. A temporary fix I found is to use <weightgroup\\s+(?:name|type)=\"(\\S+)\".*> instead.
std::regex scalewmg26x("<weight\\s+(?:.*\\s+)?id=\"(\\d+)\"\\s*(?:lhapdf=\\d+|dyn=\\s*-?\\d+)?\\s*((?:[mM][uU][rR]|renscfact)=\"(\\S+)\"\\s+(?:[mM][uU][Ff]|facscfact)=\"(\\S+)\")(\\s+.*)?</weight>"); does not match <weight MUF="0.5" MUR="0.5" PDF="325300" id="2"> MUR=0.5 MUF=0.5 </weight> since at least the id and MUF, MUR attributes are in a different order.

I am curious why the xml is parsed here through a set of regex'. This seems to lead to some inflexibility if e.g. attributes are in a different order. Is there a reason not to use a common xml library, e.g. expat or xerces? I could imagine that the LHE header does not follow xml syntax precisely.

Here is one example file: root://gfe02.grid.hep.ph.ic.ac.uk/pnfs/hep.ph.ic.ac.uk/data/cms/store/user/mkomm/HNL/miniaod16v3_200517/HNL_dirac_all_ctau1p0e00_massHNL6p0_Vall6p496e-03/miniaod16v3_200517/200517_004822/0000/HNL2016_140.root

Cheers

lcorcodilos commented 4 years ago

Hi @matt-komm, I'm wondering if you were ever in contact with someone about this? I have centrally produced signal samples that have entirely empty LHEPdfWeight branches and I suspect this is the reason why. I actually found this issue through this [1] HN thread. I've asked there if anyone has been in contact with developers but wanted to ping here as well.

Thanks!

[1] - https://hypernews.cern.ch/HyperNews/CMS/get/generators/4710/1/3/1/1.html

clelange commented 4 years ago

Adding a few people to make sure they're following, since this seems to affect quite a few analyses and I haven't seen any action being taken.

Could you please suggest a way forward?

@agrohsje @qliphy @peruzzim @fgolf

matt-komm commented 4 years ago

BTW, I tried to fix this myself but it turns out that the LHE header generated by MG is not properly formatted. Hence automatic parsing is not really possible. I am wondering if this is a known issue. In the meantime I am manually saving the weights I am interested in for analysis ...

lcorcodilos commented 4 years ago

Thanks for sharing. Would you mind sending how you did this? I'm not seeing any progress from experts (please feel free to chime in if there's a private conversation happening :-) ) so I'd like to look into this for myself.

matt-komm commented 4 years ago

@lcorcodilos I just wrote a simple plugin https://github.com/LLPDNNX/LLPReco/blob/master/NANOProducer/plugins/LHEWeightsProducer.cc which can then be instructed to store various weights if you know their ID, e.g. https://github.com/LLPDNNX/LLPReco/blob/master/NANOProducer/test/testGenWeight.py#L141-L178 This is more of a brute force, intermediate solution where one needs to manually inspect the LHE header to figure out which ID correspond to what variation for each sample ...

arizzi commented 4 years ago

Hi guys, please have a look at this, as it may fix your problem with the different MG version: https://github.com/arizzi/cmssw/commit/34f6e8feb1c6c0bd52add4a02ae2b7f8a309d816

cheers, Andrea

On Thu, Jun 25, 2020 at 1:13 PM Matthias Komm notifications@github.com wrote:

@lcorcodilos https://github.com/lcorcodilos I just wrote a simple plugin https://github.com/LLPDNNX/LLPReco/blob/master/NANOProducer/plugins/LHEWeightsProducer.cc which can then be instructed to store various weights if you know their ID, e.g. https://github.com/LLPDNNX/LLPReco/blob/master/NANOProducer/test/testGenWeight.py#L141-L178 This is more of a brute force, intermediate solution where one needs to manually inspect the LHE header to figure out which ID correspond to what variation for each sample ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cms-nanoAOD/cmssw/issues/520#issuecomment-649474265, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGKFFSFLCC2HNGDVHWSDN3RYMWJ3ANCNFSM4NHUVXOA .

arizzi commented 4 years ago

the reason for not using regular XML parser is because generators produce(d) invalid XML (e.g. opening a tag before the previous one was closed)

ph2632 commented 4 years ago

Dear all; May I kindly ping here to check whether any progress has been made fixing this? Thanks

agrohsje commented 4 years ago

Who is fixing this in the conversion from Mini to NANO-AOD? If properly configured, MG5.265 doesn't show any problems. We discussed the problem also w./ authors. For Powheg, xml check was already long time ago added to the scripts, should extend the MG5 tools in that direction.

peruzzim commented 4 years ago

It's not clear to me how to fix it centrally, if the generator content is in a broken format. I understand this is the case, it's not just adding another regexp (which is, by the way, a model we are trying to abandon). Last time I checked, lines were missing or placed in a wrong order, which makes it impossible to guess what the right thing should be without sample-specific information. The fact that the issue does not appear with the latest version of the generator does not fix the existing samples - what's the recommendation for using those?

kdlong commented 4 years ago

The only solution I see is to preprocess the header for this specific issue (for the first weight, which is now written into it's own group, the closing tag is written in the wrong place) and force it back to valid (or mostly valid...) xml before parsing it. An implementation by @dteague for this is here: https://github.com/kdlong/cmssw/blob/ImprovedParsing/GeneratorInterface/Core/src/LHEWeightHelper.cc#L103-L126

Indeed I would strongly prefer to force the generator to produce valid xml in future versions.

agrohsje commented 4 years ago

Thanks Kenneth. @peruzzim: Can this be adapted for NANO-AOD for the time being? Redoing samples isn't justified. I started modifying our scripts so that we catch possible xml bugs in new mg5 versions: https://github.com/cms-sw/genproductions/pull/2673

vargasa commented 1 year ago

From NanoAODv9 I see the branches the branches LHEScaleWeight and LHEPdfWeight with what it seems to be reasonable content. Can someone confirm If this issue was fixed?

vlimant commented 1 year ago

we'll have a related discussion next week : https://indico.cern.ch/event/1167738/

vlimant commented 1 year ago

please follow up in https://github.com/cms-sw/cmssw/pull/32167

cms-nanoAOD / cmssw

LHE weights not recognized by GenWeightsTableProducer #520