dask-contrib / dask-awkward

Native Dask collection for awkward arrays, and the library to use it.
https://dask-awkward.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 19 forks source link

dak.mask optimization not working #283

Closed jrueb closed 1 year ago

jrueb commented 1 year ago

Continuing from #272. There is one remaining issue with the dak.mask optimization.

Optimization of dak.mask when the mask is made using from_awkward will fail. The error is ValueError: mask must have boolean type, not dtype('float64') or some similar ValueError (exact error depends on the type of the input array) and comes from awkward.mask.

To reproduce:

import awkward as ak
import dask_awkward as dak

x = ak.Array({"x": [1., 2., 3.]})
ak.to_parquet(x, "masktest.parquet")
x = dak.from_parquet("masktest.parquet")
y = dak.from_awkward(ak.Array([True, True, True]), 1)
z = dak.mask(x["x"], y)
z.compute()

The cause is the following code https://github.com/dask-contrib/dask-awkward/blob/b765938afc963a4e5b8a50f85e59a39c55dfcf21/src/dask_awkward/lib/core.py#L488-L493

This will set the type of the layer for from_awkward of the mask (which is a bool array, which a specific compatible structure) to the output of mask (which could be much more). When awkward sees this incorrect type of mask, it complains. Remove the above code makes it working.

douglasdavis commented 1 year ago

This code ensures that the blockwise layer has the same metadata as the collection it represents. Unfortunately the order of the layer names in the dask highlevel graph are not necessarily descending in order of compute! We were making an incorrect assumption so the last layer in the collection's task graph had a metadata that was different from the collection's metadata. We need the collection's metadata to be the same as the blockwise layer's metadata for the optimization. Should be fixed in #285