Closed carstenbauer closed 6 months ago
Wanted this again yesterday. I'll try to add this but maybe someone wants to beat me to it :wink:
I played with it a little now. The question is what do we want when the length of the data is not a multiple of the chunksize. For example, with a data of length 7 and a chunksize of 4, we could have the following chunks:
1:7 (n = div(length(data), chunksize)
(1:4, 5:7)
(if the remaining is not zero, add 1 chunk)These two options are the easiest to implement. The second one seems reasonable. However, if the same data of 7 elements is divided in chunksizes of 5, what do we want? Keep chunks of lengths 4 and 3, or have 5 and 2?
Anyway, this is my first try: https://github.com/JuliaFolds2/ChunkSplitters.jl/commit/f2f434c457a3b64f6117a921d324ba093f25e30b
Which follows option 2.
I think option 2 is fine. The perhaps strange cases are those in which the chunksize is greater than half of the length of the data, but parallelizing under that condition in senseless anyway.
Since the inspiration for this was OpenMP (i.e. schedule(static, 8)
) it might make sense to check what OpenMP does in this case.
The OpenMP specification is:
If the kind argument is static, iterations are divided into chunks of size chunk_size, and the17 chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread18 number. Each chunk contains chunk_size iterations, except for the chunk that contains the19 sequentially last iteration, which may have fewer iterations. If chunk_size is not specified, the20 logical iteration space is divided into chunks that are approximately equal in size, and at most one21 chunk is distributed to each thread.
Which means that a 7-element data would be divided into 5-2 chunk lengths if the chunk_size
is set to 5
. Unfortunately that is more laborious to implement, as it does not comply with an equivalent definition of the number of chunks in any way.
I like what OpenMP is doing. When I specify a chunk size and look at the resulting chunks, there shouldn't be more than one that doesn't have the requested chunk size. Hence, I would expect that for a chunk size of 5, 7 elements would be split up into two chunks of size 5 and 2 respectively. (And for a chunk size of 4, I would expect two chunks of size 4 and 3 respectively.) This would also be consistent with Iterators.partition
:
julia> Iterators.partition(1:7,5) |> collect
2-element Vector{UnitRange{Int64}}:
1:5
6:7
julia> Iterators.partition(1:7,4) |> collect
2-element Vector{UnitRange{Int64}}:
1:4
5:7
The way I would implement this naively is d, r = divrem(nelements, chunksize)
. d
tells me how many full chunks I need and r
tells me if (r != 0
) I need an extra chunk and how many elements it will contain (r
).
Unfortunately that is more laborious to implement, as it does not comply with an equivalent definition of the number of chunks in any way.
If you mean that you can't just translate chunksize
into a value for the number of chunks and then use the existing algorithm, then I agree, yes. But I think we nonetheless want the OpenMP behavior.
Yes, for that we need to split the two implementations. I started it here, but it started to become ugly, and stopped for the moment.
Made a PR (#26). I left the combination size=...
+ split=:scatter
undefined for now (throws ArgumentError
) because it's not entirely clear what we would want in this case. Maybe something like this, maybe not :)
size = 2
input [1, 2, 3, 4, 5, 6, 7]
output [[1, 7], [2, 6], [3, 5], [4]]
I feel that a reasonable property to ensure is that the number and lengths of the chunks is invariant under changing split
.
I would say
[[1,4], [2,5], [3,6],[7]]
Seems more consistent to leave the remaining to the end.
Fair enough, I guess. But why "split" in the middle? We could also aim for "round-robin"
[[1, 5], [2, 6], [3, 7], [4]]
There is also the extra difficulty, that we have to express this in ranges, not arrays.
I don't know. I only see it more consistent with the fact that when using :batch
the different chunk is the last one. If for any reason someone uses that in a rational intended way, feels that this one should follow the same pattern.
The "different chunk" is still the last one: it has less elements (one in the example).
Yes, true, but not the last elements. That is something that someone could use as an information. But, again, I don't think this is really important, if you think it is more elegant the other way around.
Added a new issue related to size
and :scatter
here: https://github.com/JuliaFolds2/ChunkSplitters.jl/issues/28
I've stumbled across this a couple of times already. Note that OpenMP has
schedule(static, 8)
which means static scheduling in terms of chunks of size 8 (AFAIR). Would be great if we could write something like the following in Julia.