Shuffling a BatchView gives an ObsView of the data

awadell1 commented 2 years ago

The following gives a shuffled ObsView of the underlying data instead of shuffling the underlying data but maintaining the batch view

 using MLUtils
x = rand(100)
x_train, x_val = splitobs(x; at=0.7)
dl = BatchView(x_train; batchsize=10)
dl = shuffleobs(dl)

Gives:

julia> shuffleobs(dl)
70-element view(::Vector{Float64}, [31, 32, 33, 34, 35, 36, 37, 38, 39, 40  …  61, 62, 63, 64, 65, 66, 67, 68, 69, 70]) with eltype Float64:
 0.863309839047254
 0.09280526616938278
 0.9567810179363109
 0.5684046270934635
....

Instead of something more akin to BatchView(shuffleobs(dl.data); batchsize=dl.batchsize, partial=dl.partial)

CarloLucibello commented 2 years ago

With the fix in #97 your example becomes

julia> using MLUtils

julia> x = rand(100);

julia> x_train, x_val = splitobs(x; at=0.7);

julia> dl = BatchView(x_train; batchsize=10)
BatchView{Vector{Float64}, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Val{nothing}}([0.9166614880635612, 0.5513733116026945, 0.2664766210226831, 0.9215978618009951, 0.15930095801259392, 0.28310390900379867, 0.9813957774282672, 0.056657640264914266, 0.14508482981273974, 0.14487454412566503  …  0.7074638083968285, 0.9367841831175056, 0.160254219395352, 0.384295437306849, 0.16793652795004066, 0.3759666249745168, 0.02655634672084961, 0.3216955860573113, 0.6771047948440166, 0.19755907852314547], 10, 7, true, 70)

julia> bv = BatchView(x_train; batchsize=10)
BatchView{Vector{Float64}, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Val{nothing}}([0.9166614880635612, 0.5513733116026945, 0.2664766210226831, 0.9215978618009951, 0.15930095801259392, 0.28310390900379867, 0.9813957774282672, 0.056657640264914266, 0.14508482981273974, 0.14487454412566503  …  0.7074638083968285, 0.9367841831175056, 0.160254219395352, 0.384295437306849, 0.16793652795004066, 0.3759666249745168, 0.02655634672084961, 0.3216955860573113, 0.6771047948440166, 0.19755907852314547], 10, 7, true, 70)

julia> s = shuffleobs(bv)
ObsView(BatchView(view(::Vector{Float64}, 1:70), batchsize=10, partial=true), ::Vector{Int64})
 7 observations

julia> getobs(s, 1)
10-element Vector{Float64}:
 0.034930450515507694
 0.5960151448216738
 0.9959885409830067
 0.246327306131219
 0.8792138974218081
 0.67531260645465
 0.8935358034806211
 0.5178088319067405
 0.9759862713159224
 0.9439145657584737

julia> getobs(s, 1:2)
20-element Vector{Float64}:
 0.034930450515507694
 0.5960151448216738
 0.9959885409830067
 0.246327306131219
 0.8792138974218081
 0.67531260645465
 0.8935358034806211
 0.5178088319067405
 0.9759862713159224
 0.9439145657584737
 0.6611650944902528
 0.0678999516165758
 0.7265303330511783
 0.4733387334578564
 0.5900766453884261
 0.09660572584674165
 0.8162450409901737
 0.0512758131627854
 0.3055481424109179
 0.8606634983122741

awadell1 commented 2 years ago

From a practical perspective, I'd want shuffleobs(::BatchView) to shuffle the underlying data, and not just the order of fix batches (What #97 does). As ultimately the loop I'm going for is:

dl = BatchView(x_train; batchsize=10)
dl = shuffleobs(dl) # An in place suffleobs! would be nice
buffer = getobs(dl, 1)
for bdx in 1:length(dl)
    getobs!(buffer, dl, bdx)
    ... # Fit some model
end

I can open a PR to add shuffleobs(::AbstractRNG, ::BatchView) if that's of interest.

Best,

Alex

awadell1 commented 2 years ago

For example, in the single batch case I'd want the batches to be different ( Results are from main)

julia> using MLUtils

julia> x = rand(10);

julia> x_train, x_val = splitobs(x; at=0.5);

julia> dl = BatchView(x_train; batchsize=5)
BatchView{Vector{Float64}, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Val{nothing}}([0.5067688795812658, 0.28329526479397416, 0.5981029438210473, 0.8871276970763463, 0.7243634878808937], 5, 1, true, 5)

julia> dls = shuffleobs(dl)
ObsView(BatchView(view(::Vector{Float64}, 1:5), batchsize=5, partial=true), ::Vector{Int64})
 1 observations

julia> getobs(dls) != getobs(dl)
false

Whereas I'd want getobs(dls) != getobs(dl) to be true.

CarloLucibello commented 2 years ago

@awadell1 for that use case we have eachobs(data, batchsize=5, shuffle=true) or DataLoader(data, batchsize=5, shuffle=true), shuffling the data before batching. Does that satisfy your needs?

awadell1 commented 2 years ago

They're a lot slower for my data type, I think partially due to the generator in Dataloader / eachobs

# dl_train is a BatchViw of a custom type, with `getobs`, `getobs!` and `length` implemented 
julia> dl = DataLoader(dl_train.data; batchsize=32, shuffle=true)

julia> _, state = iterate(dl);

julia> @benchmark iterate($dl, $state)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  177.434 μs …  10.176 ms  ┊ GC (min … max):  0.00% … 97.63%
 Time  (median):     180.730 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   218.006 μs ± 548.341 μs  ┊ GC (mean ± σ):  16.58% ±  6.45%

         ▁▃▆██▇▆▅▃▁     ▁▂▃▂▁
  ▂▂▂▃▄▅▆███████████▇▇█████████▆▆▅▄▃▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▄
  177 μs           Histogram: frequency by time          190 μs <

 Memory estimate: 196.91 KiB, allocs estimate: 267.

julia> _, state = iterate(dl_train)
(ObsView(::DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, ::Vector{Int64})
 32 observations, 2)

julia> @benchmark iterate($dl_train, $state)
BenchmarkTools.Trial: 10000 samples with 952 evaluations.
 Range (min … max):   93.274 ns … 12.577 μs  ┊ GC (min … max):  0.00% … 98.26%
 Time  (median):     100.542 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   204.966 ns ±  1.058 μs  ┊ GC (mean ± σ):  47.86% ±  9.11%

  ▃█
  ██▇▅▅▄▅▅▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  93.3 ns         Histogram: frequency by time          223 ns <

 Memory estimate: 416 bytes, allocs estimate: 2.

The issue looks to be the generator in eachobs / Dataloader getting boxed:

With `Dataloader` on `v0.2.5`

julia> _, state = iterate(dl);

julia> @code_warntype iterate(dl, state)
MethodInstance for iterate(::DataLoader{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Random._GLOBAL_RNG}, ::Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64})
  from iterate(d::DataLoader, state) in MLUtils at /home/awadell/.julia/packages/MLUtils/W3W0A/src/dataloader.jl:98
Arguments
  #self#::Core.Const(iterate)
  d::DataLoader{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Random._GLOBAL_RNG}
  state::Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}
Locals
  @_4::Int64
  res::Union{Nothing, Tuple{Any, Int64}}
  i::Int64
  gen::Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}
Body::Union{Nothing, Tuple{Any, Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}}}
1 ─       nothing
│   %2  = Base.indexed_iterate(state, 1)::Core.PartialStruct(Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}, Any[Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Core.Const(2)])
│         (gen = Core.getfield(%2, 1))
│         (@_4 = Core.getfield(%2, 2))
│   %5  = Base.indexed_iterate(state, 2, @_4::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│         (i = Core.getfield(%5, 1))
│         (res = MLUtils.iterate(gen, i))
│   %8  = (res === MLUtils.nothing)::Bool
└──       goto #3 if not %8
2 ─       return nothing
3 ─ %11 = Base.getindex(res::Tuple{Any, Int64}, 1)::Any
│   %12 = gen::Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}
│   %13 = Base.getindex(res::Tuple{Any, Int64}, 2)::Int64
│   %14 = Core.tuple(%12, %13)::Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}
│   %15 = Core.tuple(%11, %14)::Tuple{Any, Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}}
└──       return %15

With `BatchView + getobs!` on `v0.2.5`

julia> _, state = iterate(dl_train);

julia> @code_warntype iterate(dl_train, state)
MethodInstance for iterate(::BatchView{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}}, ::Int64)
  from iterate(A::BatchView, state) in MLUtils at /home/awadell/.julia/packages/MLUtils/W3W0A/src/batchview.jl:123
Arguments
  #self#::Core.Const(iterate)
  A::BatchView{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}}
  state::Int64
Body::Union{Nothing, Tuple{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Int64}}
1 ─ %1 = MLUtils.numobs(A)::Int64
│   %2 = (state > %1)::Bool
└──      goto #3 if not %2
2 ─      return MLUtils.nothing
3 ─ %5 = Base.getindex(A, state)::ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}
│   %6 = (state + 1)::Int64
│   %7 = Core.tuple(%5, %6)::Tuple{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Int64}
└──      return %7

CarloLucibello commented 2 years ago

Can you give a fully reproducible example? I don't see this large discrepancy:

using MLUtils, BenchmarkTools

x = rand(10, 1000)

dl = DataLoader(x, batchsize=32, shuffle=true)
_, statedl = iterate(dl)

bv = BatchView(shuffleobs(x), batchsize=32)
_, statebv = iterate(bv)

@btime iterate($dl, $statedl); # 640.423 ns (5 allocations: 3.08 KiB)
@btime iterate($bv, $statebv); # 544.247 ns (2 allocations: 2.66 KiB)

awadell1 commented 2 years ago

I copied and pasted your example into a fresh julia session and got the following results

System	MLUtils	Dataloader	BatchView
AMD EPYC 7713, openSUSE Leap v15.3	0.2.5	531.260 ns	43.526 ns
2020 M1 Mac	0.2.5	492.485 ns	88.209 ns
AMD EPYC 7713, openSUSE Leap v15.3	46e9f2cb	587.217 ns	511.384 ns
2020 M1 Mac	46e9f2cb	583.333 ns	418.553 ns

I'm guessing the performance regression is due to the increase in allocations:

julia> @btime iterate($bv, $statebv); # M1 Mac @ 46e9f2cb
  418.553 ns (2 allocations: 2.66 KiB)

julia> @btime iterate($bv, $statebv); # M1 Max @ v0.2.5
  88.209 ns (2 allocations: 400 bytes)

A more complete example

using Pkg
Pkg.activate(; temp=true)
Pkg.add(name="MLUtils", rev="main")
Pkg.add("BenchmarkTools")
Pkg.add("CUDA")
Pkg.add("Flux")
Pkg.add("Adapt")
using MLUtils, BenchmarkTools, Flux, Adapt, CUDA

struct Wrapper{T}
    data::T
end
Adapt.@adapt_structure Wrapper
Wrapper(x::T) where {T} = Wrapper{T}(deepcopy(x))

struct CustomType{T,N}
    data::Array{T, N}

    function CustomType{T}(n...) where T
        N = length(n)
        new{T,N}(rand(T, n...))
    end
end
CustomType(n...) = CustomType{Float32}(n...)
Base.length(x::CustomType{T, N}) where {T, N} = size(x.data, N)
MLUtils.getobs(x::CustomType{T, 4}, i) where {T} = Wrapper(x.data[:, :, :, i])

function MLUtils.getobs!(buffer::Wrapper, x::CustomType{T, N}, i) where {T, N}
    buffer.data .= selectdim(x.data, N, i)
    return buffer
end

function MLUtils.getobs!(buffer, A::MLUtils.BatchView, i)
    obsindices = MLUtils._batchrange(A, i)
    return getobs!(buffer, A.data, obsindices)
end

x = CustomType(64, 64, 128, 512);

# Benchmark getobs
buffer = getobs(x, 1);
@benchmark getobs($x, idx) setup=(idx=rand(1:length(x)))
@benchmark getobs!($buffer, $x, idx) setup=(idx=rand(1:length(x)))

# Benchmark Iterate
dl = DataLoader(x, batchsize=32, shuffle=false);
_, statedl = iterate(dl);

bv = BatchView(shuffleobs(x), batchsize=32);
_, statebv = iterate(bv);

@benchmark iterate($dl, $statedl)
@benchmark iterate($bv, $statebv)
@benchmark getobs!(bv_buffer, $bv, idx) setup=(idx=rand(1:length(bv)); bv_buffer = getobs(bv, 1))

function foo(dl)
    y = 0.0
    for x in dl
        x_gpu = gpu(x)
        y += sum(x_gpu.data)
    end
    return y
end

function foo(buffer, dl)
    y = 0.0
    for bdx in 1:length(dl)
        getobs!(buffer, dl, bdx)
        x_gpu = gpu(buffer)
        y += sum(x_gpu.data)
    end
    return y
end

@benchmark CUDA.@sync(foo(dl))
@benchmark CUDA.@sync(foo(bv_buffer, bv)) setup=(bv_buffer = getobs(bv, 1))

Results

Just setting a baseline for data access. Not surprisingly getobs! is faster.

julia> @benchmark getobs($x, idx) setup=(idx=rand(1:length(x)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  51.647 μs …  3.101 ms  ┊ GC (min … max): 0.00% … 98.13%
 Time  (median):     53.451 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.234 μs ± 82.251 μs  ┊ GC (mean ± σ):  4.40% ±  4.06%

  ▆█▆▁                                         ▂▃▃▂▁  ▁▂▂▂▂▁  ▂
  ████▇██▆▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▄▄▃▁▃▄▃▁▁▃▁▁▄▁▁▅███████▇███████ █
  51.6 μs      Histogram: log(frequency) by time       123 μs <

 Memory estimate: 256.48 KiB, allocs estimate: 8.

julia> @benchmark getobs!($buffer, $x, idx) setup=(idx=rand(1:length(x)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.710 μs … 43.372 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.261 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.308 μs ±  1.010 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁ ▁                ▁▆█▇███▁▃▂      ▃▄▃▃▄ ▁
  ▁▄████▅▄▄▃▂▂▃▂▃▃▄▄▅▆▇▆██████████▇▇▇▆▇██████▇█▇▇▄▃▂▂▂▂▂▂▁▂▁▁ ▄
  10.7 μs         Histogram: frequency by time        11.9 μs <

 Memory estimate: 48 bytes, allocs estimate: 3.

I do get similar performance for BatchView and Dataloader here, I think in part due to both dispatching to getobs instead of getobs. I'm surprised that explictily using getobs! (And defining getobs!(buffer, x::BatchView, idx)) gave worse performance than the allocating versions. Any insight into what's going on here?

julia> @benchmark iterate($dl, $statedl)
BenchmarkTools.Trial: 2626 samples with 1 evaluation.
 Range (min … max):  1.755 ms …   7.785 ms  ┊ GC (min … max): 0.00% … 32.04%
 Time  (median):     1.765 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.901 ms ± 462.657 μs  ┊ GC (mean ± σ):  3.19% ±  7.53%

  █              ▁              ▁
  █▃▁▁▄▅▁▁▁▁▁▃▁▁▇█▇▄▁▁▁▁▁▃▃▁▁▁▁▃██▆▄▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▃▃█▇ █
  1.75 ms      Histogram: log(frequency) by time      4.01 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 11.

julia> @benchmark iterate($bv, $statebv)
BenchmarkTools.Trial: 2630 samples with 1 evaluation.
 Range (min … max):  1.749 ms …   5.916 ms  ┊ GC (min … max): 0.00% … 47.60%
 Time  (median):     1.762 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.898 ms ± 431.849 μs  ┊ GC (mean ± σ):  3.51% ±  8.32%

  █                                    ▂
  █▅▃▁▁▅▁▁▁▃▁▁▃▁▁▁▁▁▃▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▃█▇▅▇█▇▄▁▁▃▇▅█▇▃▁▁▁▁▁▁▆ █
  1.75 ms      Histogram: log(frequency) by time      3.64 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 9.

julia> @benchmark getobs!(bv_buffer, $bv, idx) setup=(idx=rand(1:length(bv)); bv_buffer = getobs(bv, 1))
BenchmarkTools.Trial: 1152 samples with 1 evaluation.
 Range (min … max):  2.394 ms …  3.458 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.402 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.409 ms ± 48.564 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂██▇▁
  █████▅▁▅▁▄▁▁▄▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▄▁▁▁▅▇▆▆ █
  2.39 ms      Histogram: log(frequency) by time      2.6 ms <

 Memory estimate: 416 bytes, allocs estimate: 3.

My first quess was julia was being clever and avoiding materializing the allocating versions, so I tried forcing the issue by moving the data to the GPU. But again the non-allocating version is slower:

julia> @benchmark CUDA.@sync(foo(dl))
BenchmarkTools.Trial: 68 samples with 1 evaluation.
 Range (min … max):  70.561 ms … 83.608 ms  ┊ GC (min … max): 2.46% … 8.67%
 Time  (median):     74.843 ms              ┊ GC (median):    5.28%
 Time  (mean ± σ):   74.183 ms ±  1.779 ms  ┊ GC (mean ± σ):  4.99% ± 0.91%

                          ▃█                   ▃▆ █▂
  ▇▁▁▁▁▁▁▄▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▅██▅▁▁▁▁▁▁▁▄▅▁▁▁▁▁▁▁▄███▇██▄▄▁▁▁▁▁▁▄ ▁
  70.6 ms         Histogram: frequency by time        76.1 ms <

 Memory estimate: 256.22 MiB, allocs estimate: 4110.

julia> @benchmark CUDA.@sync(foo(bv_buffer, bv)) setup=(bv_buffer = getobs(bv, 1))
BenchmarkTools.Trial: 55 samples with 1 evaluation.
 Range (min … max):  88.198 ms …  90.607 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.628 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.598 ms ± 395.388 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆▆▃              ▃▃██     ▁
  ████▁▄▁▁▄▁▁▄▁▁▁▁▁▄████▇▄▇▁▁█▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▄ ▁
  88.2 ms         Histogram: frequency by time         89.5 ms <

 Memory estimate: 205.31 KiB, allocs estimate: 3522.

CarloLucibello commented 2 years ago

@awadell1 Could you open some actionable issues based on these benchmarks?

should a first issue be about a performance regression of BatchView's iterations in the last release?
We have the option buffer=true in DataLoader. Is there a performance issue compared to manually calling getobs!?

JuliaML / MLUtils.jl