Add fast pathway for `copy`, `collect`, `tcollect`, and `tcopy` for size-stable operations

Current State

Fundamentally, Transducers is quite good at doing reductions but collecting results into an output array is a major weakness. The way that it does this currently is essentially just doing

foldxl(append!!, Map(f), coll)

(or foldxt for the parallel version). If f is expensive to evaluate, then this extra overhead isn't so bad, but for functions that can be done in a CPU cycle or two, it's catastrophic:

Here's how it currently looks with a very cheap function (abs):

julia> let A = rand(100_000)
           @btime map(abs, $A)
           @btime collect(Map(abs), $A)
           @btime tcollect(Map(abs), $A)
       end;
  31.440 μs (2 allocations: 781.30 KiB)
  70.460 μs (12 allocations: 1.83 MiB)
  212.270 μs (123 allocations: 4.54 MiB)

And here's a more expensive function (sin):

julia> let A = rand(100_000)
           @btime map(sin, $A)
           @btime collect(Map(sin), $A)
           @btime tcollect(Map(sin), $A)
       end;
  447.810 μs (2 allocations: 781.30 KiB)
  486.680 μs (12 allocations: 1.83 MiB)
  302.360 μs (123 allocations: 4.54 MiB)

This PR

In this PR I made a version of collect(xf::Transducer, coll) (and similar for copy) operating on transducers that checks if xf preserves the size of coll (i.e. Map is okay, but Filter is not), and checks if coll has a known (runtime) size. If both of those are satisfied, then we do a more optimized method that involves setindex!! on arrays.

We can't do the setindex!! thing directly for tcollect since it would cause race conditions if the output object changed, so instead for tcollect I split the collection into a bunch of chunks whose size is determined by basesize (I use Iterators.partition for this currently and want to fix that before merging to use SplittablesBase.jl).

Now here's what those benchmarks look like with my new changes: abs:

julia> let A = rand(100_000)
           @btime map(abs, $A)
           @btime collect(Map(abs), $A)
           @btime tcollect(Map(abs), $A)
       end;
  28.860 μs (2 allocations: 781.30 KiB)
  28.870 μs (2 allocations: 781.30 KiB)
  162.670 μs (244 allocations: 3.15 MiB)

and sin:

julia> let A = rand(100_000)
           @btime map(sin, $A)
           @btime collect(Map(sin), $A)
           @btime tcollect(Map(sin), $A)
       end;
  481.480 μs (2 allocations: 781.30 KiB)
  482.801 μs (2 allocations: 781.30 KiB)
  217.760 μs (244 allocations: 3.15 MiB)

So that's a nice speedup, though tcollect is still leaving some performance on the table, it's still an improvement. This should help alleviate https://github.com/tkf/ThreadsX.jl/issues/196 and https://github.com/tkf/ThreadsX.jl/issues/196, though it still won't be as fast as ThreadsX.map! since the way we combine the results from different arrays is not as efficient as preallocating and then just assigning.

Codecov Report

Merging #553 (c616391) into master (f8d0dfe) will increase coverage by 0.11%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #553      +/-   ##
==========================================
+ Coverage   95.43%   95.54%   +0.11%     
==========================================
  Files          32       32              
  Lines        2233     2268      +35     
==========================================
+ Hits         2131     2167      +36     
+ Misses        102      101       -1

Flag	Coverage Δ
Pkg.test	`94.54% <100.00%> (-0.02%)`	:arrow_down:
Run.test	`95.41% <100.00%> (+0.20%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/Transducers.jl	`73.33% <ø> (ø)`
src/core.jl	`93.15% <100.00%> (+0.09%)`	:arrow_up:
src/dreduce.jl	`100.00% <100.00%> (ø)`
src/processes.jl	`94.71% <100.00%> (+0.50%)`	:arrow_up:
src/reduce.jl	`96.61% <100.00%> (+0.18%)`	:arrow_up:

... and 2 files with indirect coverage changes

JuliaFolds / Transducers.jl