Open lgray opened 9 months ago
git bisect says the offending PR in dask is: https://github.com/dask/dask/pull/10898 , https://github.com/dask/dask/commit/8e10a14cb25c42024efc8f5ea949147d3831e8b1
Super minimal repro:
import json
import awkward as ak
from dask.base import tokenize
a = ak.Array([1, 2, 3, 4, 5])
layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
b = ak.Array(layout)
print(report.data_touched, report.shape_touched)
print(tokenize({"what": b}))
print(report.data_touched, report.shape_touched)
results in
(coffea-dev-py311) lgray@visitor-122024646 coffea % python form_madness_repr_minimal.py
[] []
002bb30fd71a51fc6fed41d095f54e48
['node0'] ['node0', 'node0']
So this is an awkward-array issue, not dask-awkward.
And here is why: https://github.com/dask/dask/blob/main/dask/base.py#L1065
The tokenizer for dicts sorts items in the dictionary using str
.
Even more minimal now:
import json
import awkward as ak
a = ak.Array([1, 2, 3, 4, 5])
layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
b = ak.Array(layout)
print(report.data_touched, report.shape_touched)
d = {"what": b}
normalized = sorted(d.items(), key=str)
print(report.data_touched, report.shape_touched)
So there we have it, I believe that is the end of the story.
Let's keep this issue open as a meta-issue to track this regression.
It's easy to get sucked in to the details here (certainly for me!), so let's just start with a birds-eye view:
ak.Array.__str__
/ ak.Array.__repr__
, seemingly both through use of str()
and pickle.dumps
in dask.base
.
In the latter case,
Now, we can certainly look to make ak.Array.__str__
et al. less touching (c.f. scikit-hep/awkward#3019): If we change the touching logic to short-circuit on NumpyArray.ndim
in https://github.com/scikit-hep/awkward/issues/3018#issuecomment-1938464388, and make str()
non-touching in , we can reduce the effect that str()
has on the graph. But, it won't remove all cases IIRC e.g. if NumpyArray.shape[1:]
is non-empty.
What's really happening here is that we have a singular (?) case of an ak.Array
being held as state by something that ends up in the task graph, namely AwkwardInputLayer
's prepare_for_projection
-output. This is being tokenized and hitting the serialisation pathway. I can't tell if this is the only place we need be worried about (@lgray seems to have pointed the finger at str
in other contexts), but this is the one I see in runnning an existing reproducer.
As such, I think the simplest solution is to implement __dask_tokenize__
for the partial returned by AwkwardInputLayer.prepare_for_projection
. If we also need to impelement a tokenize function for ak.Array
(I doubt), we can probably lean on id
.
Dask tries to pickle a partial-like class from dask-awkward in order to tokenize it
You can prevent pickle serialization by either adding a __dask_tokenize__
method to your class or a stand-alone function decorated with @normalize_token(YourClass)
:
https://docs.dask.org/en/stable/custom-collections.html#implementing-deterministic-hashing
d = {"what": b} normalized = sorted(d.items(), key=str)
We can fix this easily.
I'm seeing two different analyses above, and I suspect that only one is correct:
@lgray states that the issue is that dask is calling str() on the graph values, which is an expensive operation (I would point out it's a very bad idea to have an expensive __str__
, but that's out of scope).
This is fixed by https://github.com/dask/dask/pull/10919
@agoose77 states that tokenize() is called on the SubgraphCallable.dsk, which in turn pickles the dsk values, which in turn fails and calls __str__
. This is fixed by adding __dask_tokenize__
methods or @normalize_token.register
functions.
On the subject of
I would point out it's a very bad idea to have an expensive
It's not that str()
is expensive (at least, not directly), it's that str()
is not pure by virtue of our mechanism for buffer optimisation; we have state that is modified through compute()
that enables us to keep track of which input buffers are actually necessary to perform a computation. We can change the semantics of that slightly (i.e. https://github.com/scikit-hep/awkward/pull/3019).
@crusaderky both are correct but the instigating problem is actually the second one - the lack of a __dask_tokenize__
was causing str to get called because of hitting an exception when pickling a typetracer within the dask awkward input layer where __str__
was causing all data to be flagged as data to-be-read. It's not that it is expensive, it's that it has unintended side effects that were well reasoned at the time. In general, a package using dask may do anything with __str__
so sorting on str may lead to unintended side effects.
In our case it was the AwkwardInputLayer
getting pickled that eventually caused the __str__
but this will also happen if a user uses dak.from_awkward
and places an awkward array typetracer in the graph without an input layer, which is what the minimal reproducer checks for. It is also worth it to note that 1) happens far less frequently in user code than 2) which is routine.
In the end, both are problems, but the analysis of the failure led to originally solving 1 but not 2. Now we have fixes for both (see https://github.com/dask-contrib/dask-awkward/pull/470).
with dask 2024.2.0 results in (essentially reading the whole file...):
With dask < 2024.2.0, it results in:
For simple queries this results in multiple orders of magnitude slower performance. This is rather high severity.
@martindurant @agoose77