Open NJManganelli opened 3 months ago
Somewhere we had a discussion about ak.zip
and the way in which it touches. The practical reason that zip
touches is that it broadcasts each column against the others, which may change the type of the array (new dims, reg to var, etc).
The statement that I increasingly find helpful to surface to frame these conversations is that typetracer doesn't provide guarantees that f(X)
succeeds with type type[Y]
. Instead, it says that f(X)
succeeds with type[Y]
or fails.
Due to these statements, if we want a non-touching zip
, that is to say that we actually need a non-broadcasting
zip. In turn, this requires that we kick the "are these columns compatible" question to runtime and fail there. We need to be able to describe the structure of the result, which means that if any shapes are unknown, the result must be (such that if any of the columns of the record are omitted, the shape is unchanged).
What's not so nice about this function is that it allows for code that would fail in an eager path to run in a delayed fashion; the error-causing branches may be pruned by optimisation.
If this is just about schema building, it might be better to do this for coffea-only (as coffea is the only client so far)?
I had hoped that the ak.bundle
function signature would require the user to specify exactly what can be assumed about the broadcast-compatibility and ultimate layout expected of the result of bundling inputs together, and that that would be baked into the output dask-awkward array. At runtime I suppose the same level of checks that we see happen with ak.from_buffers
could happen but nothing beyond that, since this is intended to replace what we now do manually with a dak.from_map
over ak.from_buffers
.
As discussed in a meeting a few weeks back between @ncsmith, @jpivarski @lgray, Peter F., etc., Nick S proposed
dask_awkward.bundle
as a variation ofawkward.zip
which does not hard-touch all columns. This would allow users to build traceable forms, and could help simplify/eliminate some machinery currently employed to build schemas in e.g.coffea.
This would be eminently useful in column-joining, where we want to build a typetracer which has multiple inputs per-row (with schema applied over the union of input fields), run it through an analysis, dispatch only the minimally-necessary columns for transformation and joining through externel services, then deliver those columns for the actual computation.To keep things uniform in
awkward
, it may also be nice to have anawkward.bundle
which dispatches todask_awkward.bundle
or callsawkward.zip
for eager arrays (acting as a synonym)See: https://github.com/CoffeaTeam/coffea/issues/1163