dask / dask-blog

Dask development blog
https://blog.dask.org/
30 stars 35 forks source link

Blogpost on ragged output (with overlapping array chunks) #103

Closed GenevieveBuckley closed 3 years ago

GenevieveBuckley commented 3 years ago

I thought I'd write up some of the key takeaways from discussions we've been having about overlapping array chunks producing ragged outputs.

There didn't appear to be one obviously preferred way to do this, so I'm hoping this post will make the preferred path a little bit clearer for people doing similar work. (Originally I had some other stuff mixed in to this post, but I think that muddied the message too much).

Related discussions:

GenevieveBuckley commented 3 years ago

Since this is a relatively uncontroversial summary of previous discussions, I'm going to go ahead and merge this.

jakirkham commented 3 years ago

cc @jpivarski (in case this is of interest to you and/or your community)

jpivarski commented 3 years ago

Thanks for pointing me to it! This seems to be a "technical raggedness," though—the functions that generate the data return different length outputs, but they are to be logically viewed as a concatenated array. (Looks like a good solution to that, too.)

The HEP community, and presumably others, often have to deal with data whose meaning is ragged: the data collection includes 3 of these, 5 of those, 4 of something else, etc., and they should not be viewed as a concatenated collection because that would lose information about what it is one wants to model. So that's a different topic, and hopefully we'll be talking more about Dask-based solutions for that in the nearish future.

Thanks again for the heads-up!