Open alexander-held opened 7 months ago
FYI a fresh de-bugged ATLAS public file (MC PHYSLITE) for testing purposes can be downloaded by:
curl -sLO https://cernbox.cern.ch/remote.php/dav/public-files/BPIO76iUaeYuhaF/DAOD_PHYSLITE.37233417._000052.pool.root.1
I think the explanation here is related to what is happening in #1073
I'm trying to explain it from the perspective of how i remember it from pre-dask times:
The actual "schema" is what went into the form
argument of ak.from_buffers. There you have one key in some aribitrary mapping that returns plain numpy arrays for all the underlying buffers. That means for ListOffsetArray
you need one key for the offsets and one key for the content. I'm not fully aware how everything works with dask, but i think the magic happens somewhere in _map_schema_uproot where everything is translated to be usable for uproot.dask
.
I adapted the PHYSLITE schema from how the NanoAOD schema is structured. In NanoAOD the root files are written with plain arrays where you have one branch with the sizes of each array. So in this case this is an obvious choice to load this array for getting the offsets. In PHYSLITE we have to choose an arbitrary branch of that collection. I think at the moment the first branch that will be passed to zip_forms
will be used for offsets:
Unfortunately it often happens that this is one of these expensive-to-read double-jagged branches. I had experimented modifying the schema such that it tries to avoid using a double-jagged branch (like here, where it ends up being AnalysisJetsAuxDyn.EnergyPerSampling
) for offsets:
... but never properly tested it so i didn't merge it. Now reading the code i also notice zip_forms
has an offsets
argument. That may be a better way instead of reordering the branches. Tagging @kyungeonchoi if he want's to have a look at this at some point.
Ok, now i notice i didn't read @alexander-held's description carefully enough. He actually wants to read the the double-jagged branch EnergyPerSampling
, but gets pt
read in addition (because this is being used for the offsets). In less technical terms:
We have a long list of branches for the Jets, iterating through them they appear in this order
AnalysisJetsAuxDyn.pt
AnalysisJetsAuxDyn.eta
AnalysisJetsAuxDyn.phi
...
AnalysisJetsAuxDyn.EnergyPerSampling
...
The PHYSLITE schema will make one common form out of this where all these branches share a common offset array and the first branch (AnalysisJetsAuxDyn.pt
) is chosen to be used for the offsets. Since the form is created before you access a branch you will end up reading this branch in any case.
We should be able to not read the additional data and only the offsets with the way that ak.from_buffers
works these days.
Should follow that up.
@nikoladze is being a hero and working on an AwkwardForth solution with @jpivarski to avoid overly hardcoding things.
Describe what you want to do I am trying to read a specific branch in PHYSLITE files,
Jets.EnergyPerSampling
, and am seeing that the reading of this branch triggers the reading of another branch when using the PHSYLITE schema. I would like to understand whether this is intended. The same additional branch is not getting read with BaseSchema.reproducer:
which results in
Note the difference in the required branches.
This reproducer unfortunately relies on an ATLAS-internal file sitting at the UChicago AF behind an ATLAS login. We also have a public PHYSLITE file available which can be used to reproduce the same
dak.necessary_columns
behavior (see the commented out lines in the script), however it will crash at task graph execution time for reasons I do not understand. Perhaps it is an earlier iteration of PHYSLITE and no longer supported by the current schema version or the current version of uproot.cc @nikoladze as expert for this schema
Explain what documentation is missing This is admittedly a very technical question, might go beyond something that is all that useful in documentation but I'd just like to understand if behavior is as intended.