coffea.lookup_tools.evaluator for CMS BTV weights does not have a jet flavor dependence

CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.

https://coffeateam.github.io/coffea/

BSD 3-Clause "New" or "Revised" License

133 stars 127 forks source link

coffea.lookup_tools.evaluator for CMS BTV weights does not have a jet flavor dependence #205

Closed jrueb closed 3 years ago

jrueb commented 4 years ago

Describe the bug The coffea evaluator for CMS BTV weights depends on pt, eta and discriminator value, while it should also depend on jet flavor.

To Reproduce To reproduce, one can use the following

extractor = coffea.lookup_tools.extractor()
wgts = "* * DeepCSV_102XSF_WP_V1.csv"
extractor.add_weight_sets([wgts])
extractor.finalize()
evaluator = extractor.make_evaluator()
print(evaluator["DeepCSV_0_mujets_central_0"])

It prints the string "3 dimensional histogram", leaving no room for a jet flavor dependence.

nsmith- commented 4 years ago

It's in the key name: {taggerName}_{workingPoint}_{sfTechnique}_{systematic}_{jetType} where working point is an enum of loose, medium, tight, and jetType is 0=b 1=c 2=udcsg. I might ask @dnoonan08 to confirm these are the indices.

nsmith- commented 4 years ago

That said, clearly one wants to vectorize the evaluation over the jet hadronFlavor column, so perhaps this should be changed. For now, one can evaluate each separately and use in-place masked assignment to collate the results. Note if these are jagged arrays, its a bit more complicated, here's an example

bJetSF = evaluator['btag%iDeepCSV_1_comb_central_0'%year](tightJets.eta, tightJets.pt, tightJets.btag)
bJetSF_c = evaluator['btag%iDeepCSV_1_comb_central_1'%year](tightJets.eta, tightJets.pt, tightJets.btag)
bJetSF_udcsg = evaluator['btag%iDeepCSV_1_incl_central_2'%year](tightJets.eta, tightJets.pt, tightJets.btag)

bJetSF.content[(tightJets.hadFlav==4).content] = bJetSF_c[tightJets.hadFlav==4].content
bJetSF.content[(tightJets.hadFlav==0).content] = bJetSF_udcsg[tightJets.hadFlav==0].content

jrueb commented 4 years ago

It's in the key name: {taggerName}_{workingPoint}_{sfTechnique}_{systematic}_{jetType}

Besides the flavor evaluation not being vectorized, I also think it is very suboptimal to concatenate all the information into one long string. It only works if you know exactly what you're looking for. If that's not the case, you're required to construct a workaround with regular expressions or something similar.

For example, I can not be sure what taggerName is, especially after #207. Then there are CSV files containing working points, other versions of the same scale factors CSV file don't, thus I don't always know what workingPoint is. The same can hold true for sfTechnique.

Additionally, if one wants to use multiple working points or systematics, one will be forced to format a new string for every combination.

I think it would be very beneficial if one could use each key part individually, check whether it is present and access it. I think it could be solved with a multidimensional index, tuple indexing or simply more specific methods.

lgray commented 4 years ago

OK - since lookup tools is supposed to be incredibly generic, it sounds like this probably needs a layer similar to jetmet_tools for the JECs and such. It sounds like what needs to be kept around more is

It would help to have a use case and standard workflow to try to better understand what's useful. I haven't had to use the b-tag SFs myself so I don't know the most effective practices.

It may not be able to fix vectorization very easily. Of course some sugar over the jet flavor indices can be done, assuming people will always have a column of jet types hanging around. It looks like for a good fraction of the b-tag SF function is repeated for all the variations with some additional offset or minor variation. However, I'm not sure if that can be generalized to all btagging scale factors. If they'd stick to a single functional form it'd help, but alas....

Feel free to contribute improvements if you want this to move faster.

nsmith- commented 3 years ago

@jrueb does this tool satisfy your needs? https://coffeateam.github.io/coffea/api/coffea.btag_tools.BTagScaleFactor.html (in particular the eval function) If so, we can close this issue I think.

jrueb commented 3 years ago

Sorry for the late reply. BTagScaleFactor looks really good. I have a small request though. Would it be possible to have the systematic parameter of eval and __call__ be split up into jet flavor, so that for example it becomes possible to set the correction to "up" only for light flavor jets? In my analysis I have to treat systematics from light jets independently.

nsmith- commented 3 years ago

@lgray that's exactly the division I think we should maintain going forward: lookup_tools is generic, and then the object-specific tools that are a bit easier to use go in btag_tools, jetmet_tools, etc.

@jrueb would doing something like this work for you?

sf = btag_sf.eval("central", events.Jet.hadronFlavour, abs(events.Jet.eta), events.Jet.pt)
sf_up = btag_sf.eval("up", events.Jet.hadronFlavour, abs(events.Jet.eta), events.Jet.pt)
sf_up_light = ak.where(events.Jet.hadronFlavour<4, sf_up, sf)

The nice thing is then you only run the SF evaluation (somewhat expensive) once and then separate it later. A note on memory performance: better to use ak.where(events.Jet.hadronFlavour<4, sf_up, sf) in downstream formulas to save space on temporary arrays

jrueb commented 3 years ago

@nsmith- That looks good. Thank you!