CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffea-hep.readthedocs.io
BSD 3-Clause "New" or "Revised" License
134 stars 127 forks source link

Size of array is less than size of form with PHYSLITE schema #1083

Open alexander-held opened 7 months ago

alexander-held commented 7 months ago

Describe the bug I've ran into another bug that seems PHYSLITE schema related and occurs somewhat infrequently. Unfortunately I am not aware of suitable public files at the moment to reproduce, so I will point to the information relevant for finding them within ATLAS. If needed we can hopefully find a mechanism to share a relevant file. cc @nikoladze as PHYSLITE schema expert.

To Reproduce

from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema
import uproot

fname = "DAOD_PHYSLITE.37021106._000087.pool.root.1"
treename = "CollectionTree"

# with PHYSLITE schema
events = NanoEventsFactory.from_root({fname: treename}, schemaclass=PHYSLITESchema).events()
events.Muons.topoetcone20_CloseByCorr.compute()  # this fails

# plain uproot
f = uproot.open({fname: treename})
f["AnalysisMuonsAuxDyn.topoetcone20_CloseByCorr"].array()  # this succeeds

The read when using the schema fails, it succeeds with plain uproot. The trace ends in

TypeError: size of array (61994) is less than size of form (61999)

This error occurred while calling

    ak.from_buffers(
        RecordForm-instance
        172020
        {'[/data/xTrigDecisionAux.](https://alheld-notebook-1.notebook.af.uchicago.edu/data/xTrigDecisionAux.)%2FxTrigDecisionAux.smk%2C%21load': <awkward...
        behavior = {'Systematic': <class 'coffea.nanoevents.methods.base.Syst...
        buffer_key = partial-instance
    )

with the full trace attached below.

Examples of files to test with, all are in the

data18_13TeV-data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026

container:

DAOD_PHYSLITE.37020635._000006.pool.root.1  # no crash in coffea with this file
DAOD_PHYSLITE.37021106._000087.pool.root.1  # leads to crash in coffea

With plain uproot, both files work fine. I've also ran over many other files and have seen similar TypeError exceptions. I have not looked into their origin and tracked down whether the root cause may be similar.

Expected behavior Successful branch reading.

Output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File [/venv/lib/python3.9/site-packages/awkward/_dispatch.py:39](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/_dispatch.py#line=38), in dispatch()
     38 with OperationErrorContext(name, args, kwargs):
---> 39     gen_or_result = func(*args, **kwargs)
     40     if isgenerator(gen_or_result):

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:103](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=102), in from_buffers()
     39 """
     40 Args:
     41     form (#ak.forms.Form or str[/dict](https://alheld-notebook-1.notebook.af.uchicago.edu/dict) equivalent): The form of the Awkward
   (...)
    101 See #ak.to_buffers for examples.
    102 """
--> 103 return _impl(
    104     form,
    105     length,
    106     container,
    107     buffer_key,
    108     backend,
    109     byteorder,
    110     highlevel,
    111     behavior,
    112     attrs,
    113     allow_noncanonical_form,
    114 )

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:149](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=148), in _impl()
    147 getkey = regularize_buffer_key(buffer_key)
--> 149 out = _reconstitute(form, length, container, getkey, backend, byteorder, simplify)
    151 return wrap_layout(out, highlevel=highlevel, attrs=attrs, behavior=behavior)

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:403](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=402), in _reconstitute()
    402 elif isinstance(form, ak.forms.RecordForm):
--> 403     contents = [
    404         _reconstitute(
    405             content, length, container, getkey, backend, byteorder, simplify
    406         )
    407         for content in form.contents
    408     ]
    409     return ak.contents.RecordArray(
    410         contents,
    411         None if form.is_tuple else form.fields,
    412         length,
    413         parameters=form._parameters,
    414     )

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:404](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=403), in <listcomp>()
    402 elif isinstance(form, ak.forms.RecordForm):
    403     contents = [
--> 404         _reconstitute(
    405             content, length, container, getkey, backend, byteorder, simplify
    406         )
    407         for content in form.contents
    408     ]
    409     return ak.contents.RecordArray(
    410         contents,
    411         None if form.is_tuple else form.fields,
    412         length,
    413         parameters=form._parameters,
    414     )

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:381](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=380), in _reconstitute()
    380     next_length = 0 if len(offsets) == 1 else offsets[-1]
--> 381 content = _reconstitute(
    382     form.content, next_length, container, getkey, backend, byteorder, simplify
    383 )
    384 return ak.contents.ListOffsetArray(
    385     ak.index.Index(offsets),
    386     content,
    387     parameters=form._parameters,
    388 )

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:403](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=402), in _reconstitute()
    402 elif isinstance(form, ak.forms.RecordForm):
--> 403     contents = [
    404         _reconstitute(
    405             content, length, container, getkey, backend, byteorder, simplify
    406         )
    407         for content in form.contents
    408     ]
    409     return ak.contents.RecordArray(
    410         contents,
    411         None if form.is_tuple else form.fields,
    412         length,
    413         parameters=form._parameters,
    414     )

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:404](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=403), in <listcomp>()
    402 elif isinstance(form, ak.forms.RecordForm):
    403     contents = [
--> 404         _reconstitute(
    405             content, length, container, getkey, backend, byteorder, simplify
    406         )
    407         for content in form.contents
    408     ]
    409     return ak.contents.RecordArray(
    410         contents,
    411         None if form.is_tuple else form.fields,
    412         length,
    413         parameters=form._parameters,
    414     )

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:197](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=196), in _reconstitute()
    196 real_length = length * math.prod(form.inner_shape)
--> 197 data = _from_buffer(
    198     backend.nplike,
    199     raw_array,
    200     dtype=dtype,
    201     count=real_length,
    202     byteorder=byteorder,
    203 )
    204 if form.inner_shape != ():

File [/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py:174](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/operations/ak_from_buffers.py#line=173), in _from_buffer()
    173 if array.size < count:
--> 174     raise TypeError(
    175         f"size of array ({array.size}) is less than size of form ({count})"
    176     )
    178 return array[:count]

TypeError: size of array (61994) is less than size of form (61999)

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[73], line 7
      5 # with PHYSLITE schema
      6 events = NanoEventsFactory.from_root({fname: treename}, schemaclass=PHYSLITESchema).events()
----> 7 events.Muons.topoetcone20_CloseByCorr.compute()

File [/venv/lib/python3.9/site-packages/dask/base.py:375](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/dask/base.py#line=374), in DaskMethodsMixin.compute(self, **kwargs)
    351 def compute(self, **kwargs):
    352     """Compute this dask collection
    353 
    354     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    373     dask.compute
    374     """
--> 375     (result,) = compute(self, traverse=False, **kwargs)
    376     return result

File [/venv/lib/python3.9/site-packages/dask/base.py:661](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/dask/base.py#line=660), in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    658     postcomputes.append(x.__dask_postcompute__())
    660 with shorten_traceback():
--> 661     results = schedule(dsk, keys, **kwargs)
    663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File [/venv/lib/python3.9/site-packages/uproot/_dask.py:1343](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/uproot/_dask.py#line=1342), in __call__()
   1329     except self.allowed_exceptions as err:
   1330         return (
   1331             self.mock_empty(backend="cpu"),
   1332             _report_failure(
   (...)
   1340             ),
   1341         )
-> 1343 result, _ = self._call_impl(
   1344     file_path, object_path, i_step_or_start, n_steps_or_stop, is_chunk
   1345 )
   1346 return result

File [/venv/lib/python3.9/site-packages/uproot/_dask.py:1296](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/uproot/_dask.py#line=1295), in _call_impl()
   1290     start, stop = min((i_step_or_start * events_per_step), num_entries), min(
   1291         (i_step_or_start + 1) * events_per_step, num_entries
   1292     )
   1294 assert start <= stop
-> 1296 return self.read_tree(
   1297     ttree,
   1298     start,
   1299     stop,
   1300 )

File [/venv/lib/python3.9/site-packages/uproot/_dask.py:1017](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/uproot/_dask.py#line=1016), in read_tree()
   1009     # Otherwise, introduce a placeholder
   1010     else:
   1011         container[buffer_key] = awkward.typetracer.PlaceholderArray(
   1012             nplike=nplike,
   1013             shape=(awkward.typetracer.unknown_length,),
   1014             dtype=dtype,
   1015         )
-> 1017 out = awkward.from_buffers(
   1018     self.expected_form,
   1019     stop - start,
   1020     container,
   1021     behavior=self.form_mapping_info.behavior,
   1022     buffer_key=self.form_mapping_info.buffer_key,
   1023 )
   1024 assert tree.source  # we must be reading something here
   1025 return out, tree.source.performance_counters

File [/venv/lib/python3.9/site-packages/awkward/_dispatch.py:70](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/_dispatch.py#line=69), in dispatch()
     65     else:
     66         raise AssertionError(
     67             "high-level functions should only implement a single yield statement"
     68         )
---> 70 return gen_or_result

File [/venv/lib/python3.9/site-packages/awkward/_errors.py:85](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/_errors.py#line=84), in __exit__()
     78 try:
     79     # Handle caught exception
     80     if (
     81         exception_type is not None
     82         and issubclass(exception_type, Exception)
     83         and self.primary() is self
     84     ):
---> 85         self.handle_exception(exception_type, exception_value)
     86 finally:
     87     # Step out of the way so that another ErrorContext can become primary.
     88     if self.primary() is self:

File [/venv/lib/python3.9/site-packages/awkward/_errors.py:95](https://alheld-notebook-1.notebook.af.uchicago.edu/venv/lib/python3.9/site-packages/awkward/_errors.py#line=94), in handle_exception()
     93     self.decorate_exception(cls, exception)
     94 else:
---> 95     raise self.decorate_exception(cls, exception)

TypeError: size of array (61994) is less than size of form (61999)

This error occurred while calling

    ak.from_buffers(
        RecordForm-instance
        172020
        {'[/data/xTrigDecisionAux.](https://alheld-notebook-1.notebook.af.uchicago.edu/data/xTrigDecisionAux.)%2FxTrigDecisionAux.smk%2C%21load': <awkward...
        behavior = {'Systematic': <class 'coffea.nanoevents.methods.base.Syst...
        buffer_key = partial-instance
    )

Desktop (please complete the following information):

awkward: 2.6.2
dask-awkward: 2024.3.0
uproot: 5.3.2
coffea: 2024.3.0

Additional context n/a

alexander-held commented 6 months ago

Some more examples that might be useful when debugging: AnalysisPhotonsAuxDyn.ptcone20_CloseByCorr in root://192.170.240.143:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data18_13TeV/1f/87/DAOD_PHYSLITE.37020635._000031.pool.root.1, AnalysisPhotonsAuxDyn.topoetcone40_CloseByCorr in root://192.170.240.143:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data18_13TeV/1f/87/DAOD_PHYSLITE.37020635._000031.pool.root.1.

ponyisi commented 5 months ago

Don't know if this is relevant here, but we've seen the same issue (with other files) if the number of entries in the tree is not the same as the actual dimension of the first dimension of a branch.

nikoladze commented 2 months ago

It looks like these are actually buggy samples in the sense that different fields of the same collection don't have the same length in an event. I've seen this before and it has been explained to me this can happen due to a mechanism in athena that attempts to backfill branches only created later in the event loop with empty vectors. Now, when due to a bug in the code a branch is forgot to be filled somewhere then this mechanism can lead to wrong length 0 vectors for certain branches.

Checking if this is happening in one of the example files:

import uproot
import awkward as ak
import numpy as np

def check_collection(tree, collection_name, ref_name):
    keys = [k for k in tree.keys() if k.startswith(collection_name)]
    arrays = tree.arrays(keys)
    ref = tree[ref_name].array()
    for array, field in zip(ak.unzip(arrays), arrays.fields):
        if array.fields:
            continue
        if "/" in field:
            field = field.split("/")[1]
        field = field.split(".", maxsplit=1)[1]
        different_num = ak.num(ref) != ak.num(array)
        if ak.any(different_num):
            print(f"Different number of entries for {field}: {ak.num(array)[different_num]} vs {ak.num(ref)[different_num]} in ref, at entries {np.where(different_num)[0].tolist()}")

treename = "CollectionTree"
fname = "root://192.170.240.143:1094//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/data18_13TeV/1f/87/DAOD_PHYSLITE.37020635._000031.pool.root.1"

tree = uproot.open({fname: treename})
check_collection(tree, "AnalysisPhotonsAuxDyn", "AnalysisPhotonsAuxDyn.pt")

gives

Different number of entries for topoetcone20_CloseByCorr: [0] vs [2] in ref, at entries [58318]
Different number of entries for ptcone20_CloseByCorr: [0] vs [2] in ref, at entries [58318]
Different number of entries for topoetcone40_CloseByCorr: [0] vs [2] in ref, at entries [58318]

So this will eventually have to be fixed upstream. Of course we can't easily fix it in already produced physlite files. Currently i don't have great ideas for a workaround since we can't zip arrays with different length lists. We could fill the empty lists with None values (using masked arrays) or arbitrary values like -999 or NaN, but that would need to happen at the level when the arrays are read. Maybe one could put in something using the coffea nanoevents transforms, but it would make everything a bit ugly since every form key evaluation now would also need to process the offset array of a reference branch to figure out to which length actually fill the lists ...

sebastien-rettie commented 2 months ago

Hello, @alexander-held suggested I post here in case it can be useful since I'm seeing similar behaviour (i.e. sporadic errors or the form below that are not entirely reproducible). I'm running on an internal ATLAS file format (not PHYSLITE) and don't see any issues running something like @nikoladze's script on it. I'm happy to share any additional details or files of course.

Script to reproduce the error (might need to runpython debug_forms.py 5-10 times to get the error for sure): https://gitlab.cern.ch/atlas-physics/HDBS/DiHiggs/bbbb/bbbbarista/-/blob/boosted_dev-srettie/debug_forms.py

Schema used and preprocessing file for completeness:

End of error stack trace:

TypeError: size of array (27102) is less than size of form (59922)

This error occurred while calling

    ak.from_buffers(
        RecordForm-instance
        10998
        {'/data/eventNumber%2C%21load': array([2053010529, 2052965881, 205410...
        behavior = {'Systematic': <class 'coffea.nanoevents.methods.base.Syst...
        buffer_key = partial-instance
    )
nikoladze commented 2 months ago

@sebastien-rettie it seems you are trying to zip together the recojet_antikt10UFO and the recojet_antikt4PFlow branches - probably they should be separate. Tried the following:

from coffea.nanoevents import NanoEventsFactory
from schema import NtupleSchema

events = NanoEventsFactory.from_root({"user.caiyi.40860313._002582.output-tree.root": "AnalysisMiniTree"}, schemaclass=NtupleSchema).events()

events.compute()

This raises a similar exception - if i go into the debugger and step up until i hit from_buffers i can inspect the global form you created:

import pprint
pprint.pprint(form.to_dict())

There one can see:

[...]
              {'class': 'ListOffsetArray',
               'content': {'class': 'RecordArray',
                           'contents': [
[...]

so, a zipped collection and in the contents there are both fields starting with recojet_antikt4PFlow and recojet_antikt10UFO which have of course different sizes

sebastien-rettie commented 2 months ago

Hi @nikoladze, thanks a lot for the follow-up, that makes sense! In that case I guess I need to update the schema to group the two jet collections into separate arrays, is that right? Would you have an example of how to do this by any chance?

nikoladze commented 2 months ago

Hi @nikoladze, thanks a lot for the follow-up, that makes sense! In that case I guess I need to update the schema to group the two jet collections into separate arrays, is that right? Would you have an example of how to do this by any chance?

since this is unrelated to the issue reported here, maybe we can continue the discussion in your gitlab repo, i took the freedom to open an issue for that

matthewfeickert commented 2 months ago

It looks like these are actually buggy samples in the sense that different fields of the same collection don't have the same length in an event. I've seen this before and it has been explained to me this can happen due to a mechanism in athena that attempts to backfill branches only created later in the event loop with empty vectors. ... So this will eventually have to be fixed upstream.

@nikoladze Okay, I think this is enough to say that this isn't a coffea problem but is a problem with the PHYSLITE files themselves and that this Issue can get closed after we have an upstream issue linked here to track. If you know where this should get opened (Athena?) would you mind opening up an Issue, and then all the ATLAS people can make some noise in the relevant meetings to get this addressed?

nilserik78 commented 2 months ago

Yes, this is a known bug. It will essentially never be fully "fixed" in Athena, the best we can do is detect that it happened and then fix the specific instance of this problem inside Athena. And I thought we already ran some tests during derivation production that would flag this. Essentially nobody considers this an Ok or healthy xAOD file, not just us working on columnar analysis.

I'd say open up a ticket in the AMG JIRA for the component "Derivation Framework" (maybe add "Columnar Analysis" as a second component): https://its.cern.ch/jira/projects/ATLASG/issues