dask-contrib / dask-awkward

Native Dask collection for awkward arrays, and the library to use it.
https://dask-awkward.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 19 forks source link

feat: cache map_partitions ops #475

Closed martindurant closed 8 months ago

martindurant commented 9 months ago

@lgray , this doesn't actually work, but I thought you would appreciate a glimpse of what I had in mind.

lgray commented 9 months ago

Yeah this is more or less what I did for the histograms in the end, so that makes sense. I guess I just don't see how to pull it to the end in the case of getitems.

martindurant commented 9 months ago

don't see how to pull it to the end

What do you mean?

I am thinking, that the with_field case is essntially identical, and instead of queueing a specific set of things (items to get) like this, we can have a small structure of stuff to do, where there can be a couple of specific sorts, and for each a single method says how to execute. Execution happens as soon as we see an operation that doesn't map into the queue (or ._dask,._meta get accessed).

lgray commented 9 months ago

Oh - as in - starting from that entry point I don't see how to get it to a functioning implementation because my brain is occupied with other tasks. :-)

lgray commented 9 months ago

I'm sure I could see the whole way through in a more quiet moment. The initial direction makes a lot of sense though.

martindurant commented 9 months ago

A couple of failures here to wrap my head around, perhaps because of mutation somewhere; but here are the timings

Post
In [1]: import dask_awkward as dak
In [2]: arr = dak.from_lists([{"a": {"b": [1, 2, 3]}}]*5)
In [3]: arr2 = arr.a.b
In [4]: %timeit arr2 = arr.a.b
85.9 µs ± 280 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Pre
In [1]: import dask_awkward as dak
In [2]: arr = dak.from_lists([{"a": {"b": [1, 2, 3]}}]*5)
In [3]: arr2 = arr.a.b
In [4]: %timeit arr2 = arr.a.b
215 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Notes:

lgray commented 9 months ago

Please ping me when this one is in a state to test out!

martindurant commented 9 months ago

You can test it right now; I haven't had a chance to trace the failures, but probably it's subtle edge cases, being so few

lgray commented 9 months ago

Yeah I went ahead and tried it - definitely a noticeable improvement!

It leaves fancy indexing as the last thing that's taking significant time.

lgray commented 9 months ago

However, this PR has some annoying rebasing issues with #477 so I can't compose it all together pleasantly. Can't quite yet see the full picture of what is left.

martindurant commented 9 months ago

@lgray , merge done, back to the same three failures as before

lgray commented 9 months ago

let's go ahead and wrap this one up so far as tests are concerned. It has a rather beneficial impact.

martindurant commented 8 months ago

OK, so the problem is, that the cache also contains the output divisions, which depend on the input divisions at the time of first call. If those divisions become known, the result would be different. Interestingly, one of the couple of failing tests has this at the start:

def test_single_int(daa: dak.Array, caa: ak.Array) -> None:
    daa = dak.copy(daa)
    daa.eager_compute_divisions()

because it wants known divisions, but doesn't want to mutate the object held by the fixture.

lgray commented 8 months ago

Shouldn't be too hard to keep track of the state of divisions somehow as well?

martindurant commented 8 months ago

I should have translated: I know what's wrong, I can fix it.

lgray commented 8 months ago

Ah, notes rather than discussion, gotcha. No problem, and cool!

codecov-commenter commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.16%. Comparing base (8cb8994) to head (9e39611). Report is 29 commits behind head on main.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #475 +/- ## ========================================== + Coverage 93.06% 93.16% +0.09% ========================================== Files 23 23 Lines 3290 3322 +32 ========================================== + Hits 3062 3095 +33 + Misses 228 227 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

martindurant commented 8 months ago

OK, that fixes it. @lgray , maybe a speed test would be nice - I got around the former problem by simply not caching the divisions, as I don't think this part was contributing a lot compared to making meta and layers. I could instead make the cache more complex, if necessary.

I perhaps ought to write a test looking at the contents of the cache? I'm not sure we need that if all passes and times are clearly faster.

lgray commented 8 months ago

Yes with the awkward _new patch + multifill + this we're getting all the speedup we've seen thus far.

lgray commented 8 months ago

one nitpick here: tokenize(fn, *args, meta is not None and meta.typestr, **kwargs)

appears to be noticeable only due to the meta.typestr call (ok it's half a second but that's not small when we are down to ~4 seconds). Particularly str(self.type) over in awkward that this calls is costly when spammed.

Would be good savings if we can get around that, it would bring our test case below 4 seconds eval time.

lgray commented 8 months ago

The remaining place that may give us some time back after all these improvements appears to be:

image

I'm really not sure why __getitem__ should take so much time for a typetracer.

@agoose77 @jpivarski FYI - more things to think on.

martindurant commented 8 months ago

I thought meta.typestr was faster than str(meta), which is what the previous version would do. It sounds like it doesn't matter. So question: should map_partitions be expected to produce the identical result whether or not meta is provided? If yes, it doesn't need to be in this tokenize call at all. If no, then it does, and I don't know of a faster way to get a unique identifier of it.

Also, I reckon output_divisions should probably have been in the tokenize, since that does change the nature of the layer produced.

lgray commented 8 months ago

For building the layers themselves it doesn't matter. But I'd like to ruminate on it for a bit.

lgray commented 8 months ago

Yeah I think my position is as follows:

Therefore I agree with not tokenizing the meta.

martindurant commented 8 months ago

Done. This is final once green, unless we can think of some test that might help.

lgray commented 8 months ago

I'm trying to think of some situations where we could confuse the user with this mechanism, but I think so long as the user doesn't try to manually overwrite keys I think it's fine.

Furthermore if a user is trying to manually overwrite keys they'll probably have found the cache in the first place and can manipulate it as they need to.

martindurant commented 8 months ago

Furthermore if a user is trying to manually overwrite keys they'll probably have found the cache in the first place and can manipulate it as they need to.

Agreed, I think the mapping of collection name to graph/meta is natural. I don't even think there's any particular documentation that should go with this, except that maybe the cache size should be configurable? That doesn't need to happen yet.

lgray commented 8 months ago

I'd motion for going ahead and merging this today and getting a release out, then a bunch of wheels can turn on the coffea side of things.

lgray commented 8 months ago

@martindurant can I go ahead and merge/release? You tend to do the honors on these PRs, but I'm happy to turn the cranks.

martindurant commented 8 months ago

Go ahead