CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffeateam.github.io/coffea/
BSD 3-Clause "New" or "Revised" License
128 stars 126 forks source link

nanoevents/methods/physlite needs to distinguish between `dak.Array` and `dak.Array._meta` #1075

Closed jpivarski closed 3 weeks ago

jpivarski commented 4 months ago

This is following up from scikit-hep/awkward#3077.

In

https://github.com/CoffeaTeam/coffea/blob/750b96dc53ad8b80480181e67e0f898ccf2b6d4d/src/coffea/nanoevents/methods/physlite.py#L93-L122

load_column can have type dask_awkward.Array past the first if-guard if the event_index is not also dask_awkward.Array. Then some things are done with it that shouldn't be done with a dask_awkward.Array.

The first is ak.backend, a top-level Awkward function that hasn't had a dask-awkward overload (until PR dask-contrib/dask-awkward#498) because it would always return the same thing, "typetracer". But useless functions for the sake of conformity are worthwhile, so I'm in favor of adding dak.backend.

The next is ak.typetracer.touch_data, which is not a high-level function—not the sort of thing that we should be overriding with dask-awkward. It's a public function so that libraries that depend on Awkward (like Coffea) can use them, but it's not intended for data analysts, and that's where dask-awkward's coverage of the Awkward interface should stop. To avoid a confusing downstream error message, I've added PR scikit-hep/awkward#3079 to make Awkward complain about passing a dak.Array into ak.typetracer.touch_data, which means that the issue still has to be fixed.

The Coffea code should detect whether load_column is a dak.Array, and if so, pass load_column._meta into these hidden-but-public Awkward functions. Where, exactly, you want to do that depends on the logic of this function; I'm not going to suggest a change. I noticed that on line 115, you're again assuming that load_column is not a dak.Array (which doesn't have a layout), so someone knowledgeable about it, such as @nikoladze (the original author), should take a look.

matthewfeickert commented 4 months ago

I noticed that on line 115, you're again assuming that load_column is not a dak.Array (which doesn't have a layout), so someone knowledgeable about it, such as @nikoladze (the original author), should take a look.

Thanks @jpivarski! This is great. I think realistically though that @nikoladze is low on time to contribute patches, so I think this is going to be a joint effort from probably me, with @ekourlit, @alexander-held, and @kyungeonchoi giving advice (maybe @kratsg too). ATLAS still needs to find someone from the ATLAS Analysis Model Group (AMG) to be able to maintain the PHYSLITE code, but for the timelines that this needs to be dealt with, I'm going to take responsibility for this. Though I am not a good candidate to be that maintainer given my other project responsibilities.

nsmith- commented 1 month ago

This is just a note that https://github.com/CoffeaTeam/coffea/blob/e06c4b84d0a641ab569ae7c16fecc39fe74c9743/src/coffea/nanoevents/methods/base.py#L208-L241 are the relevant section of the NanoEvents code for NanoAOD cross-references that were ported to dask-awkward. This may be useful for @SamKelson who is looking into this.