Speculative: should we return `chunk`s of the object instead of chunks of indices?

JuliaFolds2 / ChunkSplitters.jl

Splitting collections into chunks

https://juliafolds2.github.io/ChunkSplitters.jl/

MIT License

50 stars 7 forks source link

Speculative: should we return `chunk`s of the object instead of chunks of indices? #25

Closed MasonProtter closed 1 month ago

MasonProtter commented 9 months ago

I feel like chunks is a bit of a misnomer. If I do [x for x in chunks([:a, :b, :c, :d, :e]; n=2)], it seems much more natural and also more useful to me that the result would be [[:a, :b, :c], [:d, :e]] rather than [1:1:3, 4:1:5].

This way instead of writing

for idx in chunks(A; n)
    f(@view A[idx])
end

we could just simply write

for chunk in chunks(A; n)
    f(chunk)
end

The current behaviour could then be emulated by doing

for idx in chunks(eachindex(A); n)
    f(@view A[idx])
end

Any thoughts on this? @lmiq @carstenbauer ? I think doing it like this would potentially make it easier to do things like support Dict and Dictionary or other more exotic containers.

carstenbauer commented 9 months ago

I agree that it's a bit of a misnomer. I kind of like that it returns indices though. It's cheap and well defined. Maybe we should have both variants.

About returning chunks of data (i.e. elements), what would typeof(chunk) be for generic A? E.g. what if A is a CuArray? Also a CuArray?

MasonProtter commented 9 months ago

I think that should be up to the implementer, but the default should be a view (so that it's just as cheap as the current version)

We could also have chunk_indices or index_chunks or something like that for the current behaviour

lmiq commented 9 months ago

I agree that the name is not very good, and appears to signify the data of the chunk, not its indices. I have actually implemented chunk_indices at some point, but took it back. Independently of the name, what I think is that the option that returns the indices is the easiest for the user, because otherwise, in

for chunk in chunks(A; n)
    f(chunk)
end

f is a function that operates on a collection, not in one element of the collection. My impression (and my own use of it) generally is that one has a function that operates on an element, and we want to apply it in parallel in a collection of such elements. Returning the indices makes everything very explicit for the user (and less prone to introducing allocations, instabilities, method errors).

That's why, after experimenting with some names, I ended up not changing the simplest name chunks to return something other than the indices, but I agree that the name is not ideal.

At the same time, except for the breaking aspect of the changes, I wouldn't mind that we had both behaviors in different functions (or just using eachindex).

MasonProtter commented 9 months ago

I guess from my point of view, f is typically map or reduce or whatever, so I do think of it as operating on the (sub) collection.

lmiq commented 9 months ago

I do not disagree. I think the name is not good.

Also I think that most users of this package will end up using OhMyThreads instead, so in the long run having chunks(eachindex(),n) and chunk_indices here will be ok.

I'm just afraid of releasing another breaking change. From what I've checked none of the dependents (which are only ~10 by now) updated to the new interface.

longemen3000 commented 9 months ago

a non-breaking change could be done by adding chunk_indices, and then mass-PR the packages? (i can help with that)

lmiq commented 9 months ago

We have to mass-PR packages for deprecating the old syntax anyway, so that's one alternative before releasing 3.0.

There are ~10 packages that depend on it, that's simple. I'm worried about the now various discourse threads that propose using this package, and the possible breaking of other non-public scripts.

carstenbauer commented 8 months ago

Speaking of misnomer, Chunk should probably also be called ChunkIterator - similar to PartitionIterator that Iterators.partition returns - or, in light of this thread, maybe even ChunkIndicesIterator.

(Even if we don't like the "Iterator" part, it should at least be the plural Chunks).

carstenbauer commented 4 months ago

To summarize, the current proposal is

Rename chunks to chunks_indices,
Introduce a new chunks that returns (views into) data rather than indices,
Rename Chunk type to ChunkIndicesIterator

lmiq commented 4 months ago

Chunk was an internal name up to 2.4.3, and I made it public just because OhMyThreads is using it, but it is not even documented. So that change is essentially non-breaking.
I would maybe also make getchunk and internal function. It is documented and exported now, but I do not see any use of it really outside the package, and having it exported just limits out implementation flexibility.

carstenbauer commented 1 month ago

See 3.x release :)