JuliaFolds2 / ChunkSplitters.jl

Splitting collections into chunks
https://juliafolds2.github.io/ChunkSplitters.jl/
MIT License
50 stars 7 forks source link

Changes for 3.0 #48

Closed carstenbauer closed 1 month ago

carstenbauer commented 1 month ago

Closes #47

Copied from there: A summary of the new API / changes that need to be made (likely incomplete and maybe some points are still up for debate):

What I think might make sense but I'm unsure about / not sure it's worth it:

carstenbauer commented 1 month ago
julia> using ChunkSplitters

julia> function f(x)
           s = zero(eltype(x))
           for xc in chunk(x; n=4)
               s += sum(xc)
           end
           return s
       end
f (generic function with 1 method)

julia> x=rand(10^3);

julia> f(x) ≈ sum(x)
true

julia> function f_indices(x)
           s = zero(eltype(x))
           for idcs in chunk_indices(x; n=4)
               s += sum(@view(x[idcs]))
           end
           return s
       end
f_indices (generic function with 1 method)

julia> f_indices(x) ≈ sum(x)
true

julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
 Range (min … max):  160.057 ns … 195.801 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     160.426 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.774 ns ±   1.690 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▄                                                          ▂
  ████▇▆▅▄▅▅▅▆▇██▆▅▅▄▄▅▅▄▃▅▅▆▄▄▅▄▅▄▄▅▆▆▇▇▇███▇▇▅▆▅▅▅▅▃▅▁▄▄▄▄▁▄▆ █
  160 ns        Histogram: log(frequency) by time        170 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f_indices($x)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
 Range (min … max):  159.688 ns … 181.313 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     159.898 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.350 ns ±   1.865 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▄▂      ▁                                                   ▁
  █████▆▆▆▆▇█▇▇▅▆▅▄▄▄▁▁▆▅▄▅▅▄▃▃▅▅▇▆▆▇██▇▆▅▆▆▅▄▃▅▄▄▅▅▅▄▅▆▆▃▄▆▃▄▇ █
  160 ns        Histogram: log(frequency) by time        171 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
carstenbauer commented 1 month ago

Preview of the revamped documentation: https://juliafolds2.github.io/ChunkSplitters.jl/previews/PR48/

carstenbauer commented 1 month ago

On 1.10, chunk(x; size=10) seems to allocate whereas chunk_indices(x; size=10) does not. Why?

function f_chunk(x; n=nothing, size=nothing)
    s = zero(eltype(x))
    for xdata in chunk(x; n=n, size=size)
        s += sum(xdata)
    end
    return s
end
x = rand(10^3)
b = @benchmark $f_chunk($x; n=4) samples = 1 evals = 1
@test b.allocs == 0
b = @benchmark $f_chunk($x; size=10) samples = 1 evals = 1
if VERSION > v"1.10"
    @test b.allocs == 0
else
    @test_broken b.allocs == 0 # TODO: why does this allocate on 1.10?
end
julia> b = @benchmark $f_chunk($x; size=10) samples = 1 evals = 1
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 1.542 μs (0.00% GC) to evaluate,
 with a memory estimate of 6.19 KiB, over 99 allocations.
carstenbauer commented 1 month ago

I noticed that the svg in the old/current docs is wrong.

Indicating the scattering by coloring chunks:

splitters_colors

Indicating the scattering by arrows: splitters_arrows

I prefer the latter (and will use it).

carstenbauer commented 1 month ago

Alright guys, as it turned out, I had the motivation to implement this right away. Please checkout the new (preview) docs and run some tests to see if things are stable.

What I did beyond the points mentioned in the OP is drop 1.6 support (1.10 is the new lower bound).

Also, if you have an idea about https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934 that would be great as well.

carstenbauer commented 1 month ago
MasonProtter commented 1 month ago

I'll try to look at this tonight. I suspect there's an inference problem here on 1.10, but I'll dig in and see if we can figure out what went wrong.

lmiq commented 1 month ago
  • Should we rename BatchSplit to ContiguousSplit? The latter seems to communicate the critical point more clearly: the resulting chunks hold contiguous data.

  • I don't like that we use "chunk" on the one hand and "split" on the other hand. Why two terms for the same thing? Perhaps something like chunk(x; n=4, strategy=Contiguous()) or something less verbose makes more sense as it gets rid of "split" terminology entirely?

I also like these changes, in any case.

Concerning https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934 , I've tried now many things, including manually defining functions for each case instead of relying on dispatch for _getchunkindices, but nothing worked. I couldn't detect any type inference problem either.

carstenbauer commented 1 month ago

I also like these changes, in any case.

I still think that it might make sense to consider renaming Batch → Consecutive and Scatter → RoundRobin. (I've just done this in a new commit).

As for renaming split to strategy (or something similar), there are two reasons I didn't do it right away (but I'm still open to be convinced): 1) strategy feels long compared to split 2) The package name ChunkSplitters.jl already establishes "chunk" and "split" as terminology.

codecov-commenter commented 1 month ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 92.03540% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.03%. Comparing base (ceadbb6) to head (1d4cdf4). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/internals.jl 91.89% 9 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #48 +/- ## ========================================== - Coverage 96.33% 92.03% -4.30% ========================================== Files 2 2 Lines 300 113 -187 ========================================== - Hits 289 104 -185 + Misses 11 9 -2 ``` | [Flag](https://app.codecov.io/gh/JuliaFolds2/ChunkSplitters.jl/pull/48/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [](https://app.codecov.io/gh/JuliaFolds2/ChunkSplitters.jl/pull/48/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | `92.03% <92.03%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

carstenbauer commented 1 month ago

I've implemented the suggestions by @fredrikekre (thanks for them!).

One remaining minor question is whether we should export the subtypes <:Split, i.e. Consecutive and RoundRobin (in any case, they are public API). On the one hand, it is nice that one can just do using ChunkSplitters and write chunks(x; n=10, split=RoundRobin()). OTOH, the terms are quite generic and might cause name clashes (e.g. https://juliahub.com/ui/Search?type=symbols&q=Consecutive&u=define).

If we don't export them one has to write

chunks(x; n=10, split=ChunkSplitters.RoundRobin())

# or, alternatively

using ChunkSplitters: RoundRobin
chunks(x; n=10, split=RoundRobin())

Again, I don't have a strong opinion but tend to slightly prefer not exporting them.

We could also name them RoundRobinSplit and ConsecutiveSplit, of course, at the cost of redundancy (split=RoundRobinSplit()).

Opinions?

carstenbauer commented 1 month ago

I prefer not exporting, in particular since the practice of always explicitly using symbols anyway is getting more common (ericphanson/ExplicitImports.jl).

On the one hand, it is nice that one can just do using ChunkSplitters and write chunks(x; n=10, split=RoundRobin()).

I like the trend of explicitly using the bits that one actually needs. However, for this case, it doesn't matter whether we export the <:Split types or not. The question is about what should be "imported" if one simply does using ChunkSplitters. Not exporting anything (are you suggesting that?) seems radical to me. The naive using ChunkSplitters is useful in interactive workflows where I don't want to explicitly list everything I need.

This export is actually slightly annoying because the perfect variable name for the return value is chunks, and obviously chunks = chunks(...) doesn't work.

With the current export situation I would simply use

using ChunkSplitters: ChunkSplitters

and always use the ChunkSplitters. prefix for clarity.

Once again, not exporting chunks, the main API of this package, seems bad/too radical to me. And while I see your point about the chunks name clash, the same could be said about size = size(...), length = length(...), and probably many more functions. Moreover, the typical use case of chunks(...) is in combination with for or map in which case you don't store it in a variable anyways. As you pointed out, you can always opt-out of the "import everything".

fredrikekre commented 1 month ago

Yea I just meant that if anything chunks would be the "problematic" export, not the split types.

lmiq commented 1 month ago

Just to say that I think this PR is fine, but we need to find a solution to https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934 .

(even if it means reporting an inference bug to 1.10.5 and see if the solution can be back-ported from 1.11)

fredrikekre commented 1 month ago

https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/50 fixes https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/48#issuecomment-2363287934

carstenbauer commented 1 month ago

I think this is good to go (unless we want to make other breaking changes).

lmiq commented 1 month ago

Thanks a lot @carstenbauer, and others of course, great work!

lmiq commented 1 month ago

Released 3.0 now.