CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffeateam.github.io/coffea/
BSD 3-Clause "New" or "Revised" License
132 stars 126 forks source link

PHYSLITE schema and EnergyPerSampling branch #1074

Open alexander-held opened 7 months ago

alexander-held commented 7 months ago

Describe what you want to do I am trying to read a specific branch in PHYSLITE files, Jets.EnergyPerSampling, and am seeing that the reading of this branch triggers the reading of another branch when using the PHSYLITE schema. I would like to understand whether this is intended. The same additional branch is not getting read with BaseSchema.

reproducer:

import awkward as ak
import dask
import dask_awkward as dak
import coffea
import uproot
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema, BaseSchema
import warnings
warnings.filterwarnings("ignore")

print(f"awkward: {ak.__version__}")
print(f"dask-awkward: {dak.__version__}")
print(f"uproot: {uproot.__version__}")
print(f"coffea: {coffea.__version__}")

# download test file:
# ! curl -sLO https://cernbox.cern.ch/remote.php/dav/public-files/wJGWzAyirlWE6QV/DAOD_PHYSLITE.34858087._000001.pool.root.1
# fname = "DAOD_PHYSLITE.34858087._000001.pool.root.1"

# alternative test file at UChicago (ATLAS internal unfortunately)
fname = "/data/alheld/200gbps-atlas/mc23_13p6TeV.601229.PhPy8EG_A14_ttbar_hdamp258p75_SingleLep.deriv.DAOD_PHYSLITE.e8514_s4162_r14622_p6026/DAOD_PHYSLITE.37223155._000001.pool.root.1"

treename = "CollectionTree"

# with PHYSLITE schema
events = NanoEventsFactory.from_root({fname: treename}, schemaclass=PHYSLITESchema).events()
task = events.Jets.EnergyPerSampling
print(dak.necessary_columns(task))  # AnalysisJetsAuxDyn.pt is also needed to read this successfully
_ = task.compute()  # this works on the ATLAS-internal file, not on the public one

# with base schema
events = NanoEventsFactory.from_root({fname: treename}, schemaclass=BaseSchema).events()
task = events["AnalysisJetsAuxDyn.EnergyPerSampling"]
print(dak.necessary_columns(task))  # now only the AnalysisJetsAuxDyn.EnergyPerSampling branch is needed
_ = task.compute()  # this works on the ATLAS-internal file, not on the public one

which results in

awkward: 2.6.2
dask-awkward: 2024.3.0
uproot: 5.3.2
coffea: 2024.3.0
{'from-uproot-6b3682eded84ae2d1139bd3ef492e65d': frozenset({'AnalysisJetsAuxDyn.pt', 'AnalysisJetsAuxDyn.EnergyPerSampling'})}
{'from-uproot-17eac0900f1b57bfb72e934ecf720903': frozenset({'AnalysisJetsAuxDyn.EnergyPerSampling'})}

Note the difference in the required branches.

This reproducer unfortunately relies on an ATLAS-internal file sitting at the UChicago AF behind an ATLAS login. We also have a public PHYSLITE file available which can be used to reproduce the same dak.necessary_columns behavior (see the commented out lines in the script), however it will crash at task graph execution time for reasons I do not understand. Perhaps it is an earlier iteration of PHYSLITE and no longer supported by the current schema version or the current version of uproot.

cc @nikoladze as expert for this schema

Explain what documentation is missing This is admittedly a very technical question, might go beyond something that is all that useful in documentation but I'd just like to understand if behavior is as intended.

ekourlit commented 7 months ago

FYI a fresh de-bugged ATLAS public file (MC PHYSLITE) for testing purposes can be downloaded by:
curl -sLO https://cernbox.cern.ch/remote.php/dav/public-files/BPIO76iUaeYuhaF/DAOD_PHYSLITE.37233417._000052.pool.root.1

nikoladze commented 7 months ago

I think the explanation here is related to what is happening in #1073

I'm trying to explain it from the perspective of how i remember it from pre-dask times:

The actual "schema" is what went into the form argument of ak.from_buffers. There you have one key in some aribitrary mapping that returns plain numpy arrays for all the underlying buffers. That means for ListOffsetArray you need one key for the offsets and one key for the content. I'm not fully aware how everything works with dask, but i think the magic happens somewhere in _map_schema_uproot where everything is translated to be usable for uproot.dask. I adapted the PHYSLITE schema from how the NanoAOD schema is structured. In NanoAOD the root files are written with plain arrays where you have one branch with the sizes of each array. So in this case this is an obvious choice to load this array for getting the offsets. In PHYSLITE we have to choose an arbitrary branch of that collection. I think at the moment the first branch that will be passed to zip_forms will be used for offsets:

https://github.com/CoffeaTeam/coffea/blob/750b96dc53ad8b80480181e67e0f898ccf2b6d4d/src/coffea/nanoevents/schemas/base.py#L49-L55

Unfortunately it often happens that this is one of these expensive-to-read double-jagged branches. I had experimented modifying the schema such that it tries to avoid using a double-jagged branch (like here, where it ends up being AnalysisJetsAuxDyn.EnergyPerSampling) for offsets:

https://github.com/CoffeaTeam/coffea/compare/master...nikoladze:coffea:dev-avoid-doublejagged-physlite

... but never properly tested it so i didn't merge it. Now reading the code i also notice zip_forms has an offsets argument. That may be a better way instead of reordering the branches. Tagging @kyungeonchoi if he want's to have a look at this at some point.

nikoladze commented 7 months ago

Ok, now i notice i didn't read @alexander-held's description carefully enough. He actually wants to read the the double-jagged branch EnergyPerSampling, but gets pt read in addition (because this is being used for the offsets). In less technical terms:

We have a long list of branches for the Jets, iterating through them they appear in this order

AnalysisJetsAuxDyn.pt
AnalysisJetsAuxDyn.eta
AnalysisJetsAuxDyn.phi
...
AnalysisJetsAuxDyn.EnergyPerSampling
...

The PHYSLITE schema will make one common form out of this where all these branches share a common offset array and the first branch (AnalysisJetsAuxDyn.pt) is chosen to be used for the offsets. Since the form is created before you access a branch you will end up reading this branch in any case.

lgray commented 7 months ago

We should be able to not read the additional data and only the offsets with the way that ak.from_buffers works these days. Should follow that up.

matthewfeickert commented 2 months ago

@nikoladze is being a hero and working on an AwkwardForth solution with @jpivarski to avoid overly hardcoding things.