dask-contrib / dask-awkward

Native Dask collection for awkward arrays, and the library to use it.
https://dask-awkward.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 19 forks source link

wip: fix axis=0 concatenation #457

Open agoose77 opened 10 months ago

agoose77 commented 10 months ago

Fixes #456

It's clear from #456 that the complexity around axis=0 concatenation makes it hard to reason about!

The bug is caused by the loss of report objects associated with secondary arrays in the concatenate call; right now we take the first array (because all arrays are coerced to the same form) as the "meta".

To fix this, we need to build a "parent" meta such that parent.touch_data("key-1") invokes child1.touch_data("key-1-child"), child2.touch_data("key-2-child"), and so on.

I'll try and clean this up, probably with an Awkward fix, in the near future.

codecov-commenter commented 10 months ago

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (d3b6208) 93.10% compared to head (c62796d) 92.72%.

Files Patch % Lines
src/dask_awkward/lib/operations.py 73.01% 17 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #457 +/- ## ========================================== - Coverage 93.10% 92.72% -0.38% ========================================== Files 23 23 Lines 3293 3355 +62 ========================================== + Hits 3066 3111 +45 - Misses 227 244 +17 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jpivarski commented 8 months ago

Is this affected, or even made easier, by scikit-hep/awkward#3043 (and subsequent changes to dask-awkward to defensively copy reports instead of changing them in place throughout a DAG)?

Once the reports become effectively immutable from dask-awkward's point of view—that is, each DAG node is associated with an unchanging report—then we won't need to have a hierarchical structure, nesting reports, since we could make the output of dak.concatenate propagate all the reports from its inputs.

I came to this from #456, since @gordonwatts hit the same issue again and I was checking up on it.

agoose77 commented 8 months ago

I think this will become slightly simpler, but much of the logic will remain. The feature required by result = ak.concatenate((part1, part2, ..., partN)) is that touching result.pt.data also touches part1.pt.data, part2.pt.data, ..., partN.pt.data. I think we'll need some kind of hierarchy. We could encode that as a hierarchical report or with a parentage attribute in the report.