Open alexander-held opened 7 months ago
Some more examples that might be useful when debugging: AnalysisPhotonsAuxDyn.ptcone20_CloseByCorr
in root://192.170.240.143:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data18_13TeV/1f/87/DAOD_PHYSLITE.37020635._000031.pool.root.1
, AnalysisPhotonsAuxDyn.topoetcone40_CloseByCorr
in root://192.170.240.143:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data18_13TeV/1f/87/DAOD_PHYSLITE.37020635._000031.pool.root.1
.
Don't know if this is relevant here, but we've seen the same issue (with other files) if the number of entries in the tree is not the same as the actual dimension of the first dimension of a branch.
It looks like these are actually buggy samples in the sense that different fields of the same collection don't have the same length in an event. I've seen this before and it has been explained to me this can happen due to a mechanism in athena that attempts to backfill branches only created later in the event loop with empty vectors. Now, when due to a bug in the code a branch is forgot to be filled somewhere then this mechanism can lead to wrong length 0 vectors for certain branches.
Checking if this is happening in one of the example files:
import uproot
import awkward as ak
import numpy as np
def check_collection(tree, collection_name, ref_name):
keys = [k for k in tree.keys() if k.startswith(collection_name)]
arrays = tree.arrays(keys)
ref = tree[ref_name].array()
for array, field in zip(ak.unzip(arrays), arrays.fields):
if array.fields:
continue
if "/" in field:
field = field.split("/")[1]
field = field.split(".", maxsplit=1)[1]
different_num = ak.num(ref) != ak.num(array)
if ak.any(different_num):
print(f"Different number of entries for {field}: {ak.num(array)[different_num]} vs {ak.num(ref)[different_num]} in ref, at entries {np.where(different_num)[0].tolist()}")
treename = "CollectionTree"
fname = "root://192.170.240.143:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data18_13TeV/1f/87/DAOD_PHYSLITE.37020635._000031.pool.root.1"
tree = uproot.open({fname: treename})
check_collection(tree, "AnalysisPhotonsAuxDyn", "AnalysisPhotonsAuxDyn.pt")
gives
Different number of entries for topoetcone20_CloseByCorr: [0] vs [2] in ref, at entries [58318]
Different number of entries for ptcone20_CloseByCorr: [0] vs [2] in ref, at entries [58318]
Different number of entries for topoetcone40_CloseByCorr: [0] vs [2] in ref, at entries [58318]
So this will eventually have to be fixed upstream. Of course we can't easily fix it in already produced physlite files. Currently i don't have great ideas for a workaround since we can't zip arrays with different length lists. We could fill the empty lists with None values (using masked arrays) or arbitrary values like -999 or NaN, but that would need to happen at the level when the arrays are read. Maybe one could put in something using the coffea nanoevents transforms, but it would make everything a bit ugly since every form key evaluation now would also need to process the offset array of a reference branch to figure out to which length actually fill the lists ...
Hello, @alexander-held suggested I post here in case it can be useful since I'm seeing similar behaviour (i.e. sporadic errors or the form below that are not entirely reproducible). I'm running on an internal ATLAS file format (not PHYSLITE) and don't see any issues running something like @nikoladze's script on it. I'm happy to share any additional details or files of course.
Script to reproduce the error (might need to runpython debug_forms.py
5-10 times to get the error for sure): https://gitlab.cern.ch/atlas-physics/HDBS/DiHiggs/bbbb/bbbbarista/-/blob/boosted_dev-srettie/debug_forms.py
Schema used and preprocessing file for completeness:
End of error stack trace:
TypeError: size of array (27102) is less than size of form (59922)
This error occurred while calling
ak.from_buffers(
RecordForm-instance
10998
{'/data/eventNumber%2C%21load': array([2053010529, 2052965881, 205410...
behavior = {'Systematic': <class 'coffea.nanoevents.methods.base.Syst...
buffer_key = partial-instance
)
@sebastien-rettie it seems you are trying to zip together the recojet_antikt10UFO
and the recojet_antikt4PFlow
branches - probably they should be separate. Tried the following:
from coffea.nanoevents import NanoEventsFactory
from schema import NtupleSchema
events = NanoEventsFactory.from_root({"user.caiyi.40860313._002582.output-tree.root": "AnalysisMiniTree"}, schemaclass=NtupleSchema).events()
events.compute()
This raises a similar exception - if i go into the debugger and step up until i hit from_buffers
i can inspect the global form you created:
import pprint
pprint.pprint(form.to_dict())
There one can see:
[...]
{'class': 'ListOffsetArray',
'content': {'class': 'RecordArray',
'contents': [
[...]
so, a zipped collection and in the contents there are both fields starting with recojet_antikt4PFlow
and recojet_antikt10UFO
which have of course different sizes
Hi @nikoladze, thanks a lot for the follow-up, that makes sense! In that case I guess I need to update the schema to group the two jet collections into separate arrays, is that right? Would you have an example of how to do this by any chance?
Hi @nikoladze, thanks a lot for the follow-up, that makes sense! In that case I guess I need to update the schema to group the two jet collections into separate arrays, is that right? Would you have an example of how to do this by any chance?
since this is unrelated to the issue reported here, maybe we can continue the discussion in your gitlab repo, i took the freedom to open an issue for that
It looks like these are actually buggy samples in the sense that different fields of the same collection don't have the same length in an event. I've seen this before and it has been explained to me this can happen due to a mechanism in athena that attempts to backfill branches only created later in the event loop with empty vectors. ... So this will eventually have to be fixed upstream.
@nikoladze Okay, I think this is enough to say that this isn't a coffea
problem but is a problem with the PHYSLITE files themselves and that this Issue can get closed after we have an upstream issue linked here to track. If you know where this should get opened (Athena?) would you mind opening up an Issue, and then all the ATLAS people can make some noise in the relevant meetings to get this addressed?
Yes, this is a known bug. It will essentially never be fully "fixed" in Athena, the best we can do is detect that it happened and then fix the specific instance of this problem inside Athena. And I thought we already ran some tests during derivation production that would flag this. Essentially nobody considers this an Ok or healthy xAOD file, not just us working on columnar analysis.
I'd say open up a ticket in the AMG JIRA for the component "Derivation Framework" (maybe add "Columnar Analysis" as a second component): https://its.cern.ch/jira/projects/ATLASG/issues
Describe the bug I've ran into another bug that seems PHYSLITE schema related and occurs somewhat infrequently. Unfortunately I am not aware of suitable public files at the moment to reproduce, so I will point to the information relevant for finding them within ATLAS. If needed we can hopefully find a mechanism to share a relevant file. cc @nikoladze as PHYSLITE schema expert.
To Reproduce
The read when using the schema fails, it succeeds with plain uproot. The trace ends in
with the full trace attached below.
Examples of files to test with, all are in the
container:
With plain uproot, both files work fine. I've also ran over many other files and have seen similar
TypeError
exceptions. I have not looked into their origin and tracked down whether the root cause may be similar.Expected behavior Successful branch reading.
Output
Desktop (please complete the following information):
Additional context n/a