Closed carstenbauer closed 1 month ago
julia> using ChunkSplitters
julia> function f(x)
s = zero(eltype(x))
for xc in chunk(x; n=4)
s += sum(xc)
end
return s
end
f (generic function with 1 method)
julia> x=rand(10^3);
julia> f(x) ≈ sum(x)
true
julia> function f_indices(x)
s = zero(eltype(x))
for idcs in chunk_indices(x; n=4)
s += sum(@view(x[idcs]))
end
return s
end
f_indices (generic function with 1 method)
julia> f_indices(x) ≈ sum(x)
true
julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
Range (min … max): 160.057 ns … 195.801 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 160.426 ns ┊ GC (median): 0.00%
Time (mean ± σ): 160.774 ns ± 1.690 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄██▄ ▂
████▇▆▅▄▅▅▅▆▇██▆▅▅▄▄▅▅▄▃▅▅▆▄▄▅▄▅▄▄▅▆▆▇▇▇███▇▇▅▆▅▅▅▅▃▅▁▄▄▄▄▁▄▆ █
160 ns Histogram: log(frequency) by time 170 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f_indices($x)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
Range (min … max): 159.688 ns … 181.313 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 159.898 ns ┊ GC (median): 0.00%
Time (mean ± σ): 160.350 ns ± 1.865 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▄▂ ▁ ▁
█████▆▆▆▆▇█▇▇▅▆▅▄▄▄▁▁▆▅▄▅▅▄▃▃▅▅▇▆▆▇██▇▆▅▆▆▅▄▃▅▄▄▅▅▅▄▅▆▆▃▄▆▃▄▇ █
160 ns Histogram: log(frequency) by time 171 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Preview of the revamped documentation: https://juliafolds2.github.io/ChunkSplitters.jl/previews/PR48/
On 1.10, chunk(x; size=10)
seems to allocate whereas chunk_indices(x; size=10)
does not. Why?
function f_chunk(x; n=nothing, size=nothing)
s = zero(eltype(x))
for xdata in chunk(x; n=n, size=size)
s += sum(xdata)
end
return s
end
x = rand(10^3)
b = @benchmark $f_chunk($x; n=4) samples = 1 evals = 1
@test b.allocs == 0
b = @benchmark $f_chunk($x; size=10) samples = 1 evals = 1
if VERSION > v"1.10"
@test b.allocs == 0
else
@test_broken b.allocs == 0 # TODO: why does this allocate on 1.10?
end
julia> b = @benchmark $f_chunk($x; size=10) samples = 1 evals = 1
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 1.542 μs (0.00% GC) to evaluate,
with a memory estimate of 6.19 KiB, over 99 allocations.
I noticed that the svg in the old/current docs is wrong.
Indicating the scattering by coloring chunks:
Indicating the scattering by arrows:
I prefer the latter (and will use it).
Alright guys, as it turned out, I had the motivation to implement this right away. Please checkout the new (preview) docs and run some tests to see if things are stable.
What I did beyond the points mentioned in the OP is drop 1.6 support (1.10 is the new lower bound).
Also, if you have an idea about https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934 that would be great as well.
examples/
folderI'll try to look at this tonight. I suspect there's an inference problem here on 1.10, but I'll dig in and see if we can figure out what went wrong.
Should we rename
BatchSplit
toContiguousSplit
? The latter seems to communicate the critical point more clearly: the resulting chunks hold contiguous data.I don't like that we use "chunk" on the one hand and "split" on the other hand. Why two terms for the same thing? Perhaps something like
chunk(x; n=4, strategy=Contiguous())
or something less verbose makes more sense as it gets rid of "split" terminology entirely?
I also like these changes, in any case.
Concerning https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934 , I've tried now many things, including manually defining functions for each case instead of relying on dispatch for _getchunkindices
, but nothing worked. I couldn't detect any type inference problem either.
I also like these changes, in any case.
I still think that it might make sense to consider renaming Batch → Consecutive
and Scatter → RoundRobin
. (I've just done this in a new commit).
As for renaming split
to strategy
(or something similar), there are two reasons I didn't do it right away (but I'm still open to be convinced):
1) strategy
feels long compared to split
2) The package name ChunkSplitters.jl
already establishes "chunk" and "split" as terminology.
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Attention: Patch coverage is 92.03540%
with 9 lines
in your changes missing coverage. Please review.
Project coverage is 92.03%. Comparing base (
ceadbb6
) to head (1d4cdf4
). Report is 1 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
src/internals.jl | 91.89% | 9 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I've implemented the suggestions by @fredrikekre (thanks for them!).
One remaining minor question is whether we should export the subtypes <:Split
, i.e. Consecutive
and RoundRobin
(in any case, they are public API). On the one hand, it is nice that one can just do using ChunkSplitters
and write chunks(x; n=10, split=RoundRobin())
. OTOH, the terms are quite generic and might cause name clashes (e.g. https://juliahub.com/ui/Search?type=symbols&q=Consecutive&u=define).
If we don't export them one has to write
chunks(x; n=10, split=ChunkSplitters.RoundRobin())
# or, alternatively
using ChunkSplitters: RoundRobin
chunks(x; n=10, split=RoundRobin())
Again, I don't have a strong opinion but tend to slightly prefer not exporting them.
We could also name them RoundRobinSplit
and ConsecutiveSplit
, of course, at the cost of redundancy (split=RoundRobinSplit()
).
Opinions?
I prefer not exporting, in particular since the practice of always explicitly
using
symbols anyway is getting more common (ericphanson/ExplicitImports.jl).On the one hand, it is nice that one can just do using ChunkSplitters and write chunks(x; n=10, split=RoundRobin()).
I like the trend of explicitly using
the bits that one actually needs. However, for this case, it doesn't matter whether we export the <:Split
types or not. The question is about what should be "imported" if one simply does using ChunkSplitters
. Not exporting anything (are you suggesting that?) seems radical to me. The naive using ChunkSplitters
is useful in interactive workflows where I don't want to explicitly list everything I need.
This export is actually slightly annoying because the perfect variable name for the return value is
chunks
, and obviouslychunks = chunks(...)
doesn't work.With the current export situation I would simply use
using ChunkSplitters: ChunkSplitters
and always use the
ChunkSplitters.
prefix for clarity.
Once again, not exporting chunks
, the main API of this package, seems bad/too radical to me. And while I see your point about the chunks
name clash, the same could be said about size = size(...)
, length = length(...)
, and probably many more functions. Moreover, the typical use case of chunks(...)
is in combination with for
or map
in which case you don't store it in a variable anyways. As you pointed out, you can always opt-out of the "import everything".
Yea I just meant that if anything chunks
would be the "problematic" export, not the split types.
Just to say that I think this PR is fine, but we need to find a solution to https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934 .
(even if it means reporting an inference bug to 1.10.5 and see if the solution can be back-ported from 1.11)
I think this is good to go (unless we want to make other breaking changes).
Thanks a lot @carstenbauer, and others of course, great work!
Released 3.0 now.
Closes #47
Copied from there: A summary of the new API / changes that need to be made (likely incomplete and maybe some points are still up for debate):
chunk(x; ...)
that returns an iterator that delivers the chunks, i.e. views intox
. (We choosechunk
, singular, here to indicate the verb formto chunk
. This is to be in line withenumerate
,zip
, and co.)chunks(x; ...)
tochunk_indices(x; ...)
and make the latter return an iterator that delivers the indices (ofx
) for each chunk.iterator names:I implemented a single iteratorChunk
(returned bychunk
) andChunkIndices
(returned bychunk_indices
). This is in line withBase.Iterators.Enumerate
(returned byenumerate
),Base.Pairs
(returned bypairs
), andBase.Iterators.Zip
(returned byzip
)ChunksIterator
that has a type parameterReturnIndices
orReturnViews
to indicate what will be eventually returned.Symbol
option forsplit
. While it is slightly more convenient, it can be less efficient and it doesn't make much sense to have two ways to do the same thing.getchunk
internal (not public API anymore) + rename it togetchunkindices
What I think might make sense but I'm unsure about / not sure it's worth it:
BatchSplit
toContiguousSplit
? The latter seems to communicate the critical point more clearly: the resulting chunks hold contiguous data.chunk(x; n=4, strategy=Contiguous())
or something less verbose makes more sense as it gets rid of "split" terminology entirely?