dask-contrib / dask-awkward

Native Dask collection for awkward arrays, and the library to use it.
https://dask-awkward.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 19 forks source link

feature request: dask_awkward.bundle, a conditional-touching awkward.zip function #536

Open NJManganelli opened 3 months ago

NJManganelli commented 3 months ago

As discussed in a meeting a few weeks back between @ncsmith, @jpivarski @lgray, Peter F., etc., Nick S proposed dask_awkward.bundle as a variation of awkward.zip which does not hard-touch all columns. This would allow users to build traceable forms, and could help simplify/eliminate some machinery currently employed to build schemas in e.g. coffea. This would be eminently useful in column-joining, where we want to build a typetracer which has multiple inputs per-row (with schema applied over the union of input fields), run it through an analysis, dispatch only the minimally-necessary columns for transformation and joining through externel services, then deliver those columns for the actual computation.

To keep things uniform in awkward, it may also be nice to have an awkward.bundle which dispatches to dask_awkward.bundle or calls awkward.zip for eager arrays (acting as a synonym)

See: https://github.com/CoffeaTeam/coffea/issues/1163

agoose77 commented 3 months ago

Somewhere we had a discussion about ak.zip and the way in which it touches. The practical reason that zip touches is that it broadcasts each column against the others, which may change the type of the array (new dims, reg to var, etc).

The statement that I increasingly find helpful to surface to frame these conversations is that typetracer doesn't provide guarantees that f(X) succeeds with type type[Y]. Instead, it says that f(X) succeeds with type[Y] or fails.

Due to these statements, if we want a non-touching zip, that is to say that we actually need a non-broadcasting zip. In turn, this requires that we kick the "are these columns compatible" question to runtime and fail there. We need to be able to describe the structure of the result, which means that if any shapes are unknown, the result must be (such that if any of the columns of the record are omitted, the shape is unchanged).

What's not so nice about this function is that it allows for code that would fail in an eager path to run in a delayed fashion; the error-causing branches may be pruned by optimisation.

If this is just about schema building, it might be better to do this for coffea-only (as coffea is the only client so far)?

nsmith- commented 2 months ago

I had hoped that the ak.bundle function signature would require the user to specify exactly what can be assumed about the broadcast-compatibility and ultimate layout expected of the result of bundling inputs together, and that that would be baked into the output dask-awkward array. At runtime I suppose the same level of checks that we see happen with ak.from_buffers could happen but nothing beyond that, since this is intended to replace what we now do manually with a dak.from_map over ak.from_buffers.