CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffeateam.github.io/coffea/
BSD 3-Clause "New" or "Revised" License
128 stars 126 forks source link

Inconsistent behaviour of coffea.nanoevents.methods.nanoaod.JetArray compared with vector.backends.awkward.MomentumArray4D #1162

Closed green-cabbage closed 1 month ago

green-cabbage commented 1 month ago

Hello, I have been quick testing my skimmed jet kinematic outputs for my analysis workflows from the previous coffea version to the latest one, which replaces the internal vector to the hep vector package. However, when calculating the rapidity property of the vector, I observe some discrepancies:

import vector
import awkward as ak
import dask_awkward as dak
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema

file_dict = {"root://eos.cms.rcac.purdue.edu:1094///store/mc/RunIISummer20UL18NanoAODv9/DYJetsToLL_M-100to200_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/40000/AA6F89B0-EDAA-3942-A3BB-BC3709722EB4.root": {"object_path": "Events", "steps": [[0, 10], [10, 20]], "uuid": "b16b5dea-fbcd-11ed-bae7-a2a0b8bcbeef"}}

events = NanoEventsFactory.from_root(
    file_dict,
    schemaclass=NanoAODSchema,
).events()
jet1 = events.Jet[:,0]

jet1_4D_vec = ak.zip({"x":jet1.x, "y":jet1.y, "z":jet1.z, "E":jet1.E}, with_name="Momentum4D")
print(ak.isclose(jet1_4D_vec.rapidity.compute(), jet1.rapidity.compute()))

This code prints [False, True, True, True, True, False, ..., True, True, True, True, False, True], which is surprising to me since I expected jet1_4D_vec.rapidity and jet1.rapidity to be equivalent. Perhaps the tolerance was too tight, so I instead tried this code: print(ak.sum(ak.isclose(jet1_4D_vec.rapidity.compute(), jet1.rapidity.compute(), atol=1e-5)))

Which returned a value of 17 (where total test sample was 20 events), meaning that 3 events went beyond the loosened tolerances. Given this information, I think it's quite likely that somewhere the coffea implementation of hep vector has gone wrong. Please let me know if you have any questions!

lgray commented 1 month ago

Yes there are more redefinitions of r and rho in some of the derived classes. You should try editing your local version of coffea to remove them and see when things become consistent. It will help me fix it more quickly.

green-cabbage commented 1 month ago

I have been taking another good look, and I noticed this warning statement:

[/depot/cms/kernels/root632/lib/python3.12/site-packages/coffea/nanoevents/methods/candidate.py:11](https://cms.geddes.rcac.purdue.edu/depot/cms/kernels/root632/lib/python3.12/site-packages/coffea/nanoevents/methods/candidate.py#line=10): FutureWarning: In version 2025.1.0 (target date: 2024-12-31 11:59:59-06:00), this will be an error.
To raise these warnings as errors (and get stack traces to find out where they're called), run
    import warnings
    warnings.filterwarnings("error", module="coffea.*")
after the first `import coffea` or use `@pytest.mark.filterwarnings("error:::coffea.*")` in pytest.
Issue: coffea.nanoevents.methods.vector will be removed and replaced with scikit-hep vector. Nanoevents schemas internal to coffea will be migrated. Otherwise please consider using that package!.
  from coffea.nanoevents.methods import vector

Whenever I called NanoEventsFactory.from_root(). I took a look at src/coffea/nanoevents/methods/candidate.py, and it looks like in that file it still explicitly calls on the old vector method. Unfortunately, I haven't had the chance to comment these out and see if it solves my workflow yet. I don't have write permission on my python env, so I gotta make a separate conda env every time I wanna prototype, which is a lengthy process since conda is slow at making new envs)

lgray commented 1 month ago

That's because right now vector is wrapped by the old coffea vector, which is now basically a shim class. There are a few methods that didn't get deleted that are probably screwing up the math.

To solve your conda env slowness issues, you should use mamba instead (conda install mamba -c conda-forge, and then you and drop-in replace conda <whatever> with mamba <whatever>.

Similarly you can create a conda environment git clone coffea and then pip install -e '.[dev]' and work with the code in real time, no rebuild necessary.

green-cabbage commented 1 month ago

Thanks for the suggestion; I have gone to do just that, and I was wondering if I could ask some questions after playing around a bit.

Where exactly is JetArray defined in the nanoaod.py file? I searched the whole repo and I couldn't find it, which is quite strange to me. Also, this may be a noob question, but what does noqa: F821 mean in the comments of the linked code? It's repeated many times in other parts of the src code.

Furthermore, what's the logic behind the these set of assignments? This pattern exists in other classes such as Muon and Electron, so I don't think it's a bug, but also the setup is switched abit in CandidateArray and PtEtaPhiECandidateArray , where it's the MomentumClass which is given the "Array" assignment this time (in JetArray, the MomentumClass is assigned the LorentzVectorArray.

Thanks in advance!

lgray commented 1 month ago

It's created when you use the @mixin decorator.