dask-contrib / dask-awkward

Native Dask collection for awkward arrays, and the library to use it.
https://dask-awkward.readthedocs.io
BSD 3-Clause "New" or "Revised" License
60 stars 19 forks source link

Use `partial` rather than custom partial classes #469

Open agoose77 opened 7 months ago

agoose77 commented 7 months ago

Looking at the internals of dask, it seems that features such as tokenize have fast-path logic for partial. If instead dask encounters class instances, it has to invoke a serialiser. We should probably just move to partial (which will also reduce the sloc).

martindurant commented 7 months ago

You mean specifically the class itertools.partial? Which function classes are you thinking about in particular, like FromParquetFn ?

agoose77 commented 7 months ago

@martindurant yes, much of dask-awkward's curried calls to ak.XXX operations are class-based, e.g. the structure module.

martindurant commented 7 months ago

Microbenchmark

from functools import partial

kwargs = {"arg": 1}

class Callme:
  def __init__(self, fn, kwargs):
    self.fn = fn
    self.kwargs = kwargs

  def __call__(self, *args, **kwargs):
    return self.fn(*args, **self.kwargs, **kwargs)

one = Callme(sum, kwargs)
two = partial(sum, **kwargs)
In [15]: %timeit dask.base.tokenize(callme.one)
1.66 µs ± 3.49 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [16]: %timeit dask.base.tokenize(callme.two)
1.71 µs ± 2.78 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
martindurant commented 7 months ago

If I make kwargs bigger, I find that the class tokenises faster than partial

agoose77 commented 7 months ago

That's fascinating! I'm guessing you didn't run the code as above, because callme.two isn't defined?

martindurant commented 7 months ago

The first block was a module called "callme"