Closed awadell1 closed 2 years ago
With the fix in #97 your example becomes
julia> using MLUtils
julia> x = rand(100);
julia> x_train, x_val = splitobs(x; at=0.7);
julia> dl = BatchView(x_train; batchsize=10)
BatchView{Vector{Float64}, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Val{nothing}}([0.9166614880635612, 0.5513733116026945, 0.2664766210226831, 0.9215978618009951, 0.15930095801259392, 0.28310390900379867, 0.9813957774282672, 0.056657640264914266, 0.14508482981273974, 0.14487454412566503 … 0.7074638083968285, 0.9367841831175056, 0.160254219395352, 0.384295437306849, 0.16793652795004066, 0.3759666249745168, 0.02655634672084961, 0.3216955860573113, 0.6771047948440166, 0.19755907852314547], 10, 7, true, 70)
julia> bv = BatchView(x_train; batchsize=10)
BatchView{Vector{Float64}, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Val{nothing}}([0.9166614880635612, 0.5513733116026945, 0.2664766210226831, 0.9215978618009951, 0.15930095801259392, 0.28310390900379867, 0.9813957774282672, 0.056657640264914266, 0.14508482981273974, 0.14487454412566503 … 0.7074638083968285, 0.9367841831175056, 0.160254219395352, 0.384295437306849, 0.16793652795004066, 0.3759666249745168, 0.02655634672084961, 0.3216955860573113, 0.6771047948440166, 0.19755907852314547], 10, 7, true, 70)
julia> s = shuffleobs(bv)
ObsView(BatchView(view(::Vector{Float64}, 1:70), batchsize=10, partial=true), ::Vector{Int64})
7 observations
julia> getobs(s, 1)
10-element Vector{Float64}:
0.034930450515507694
0.5960151448216738
0.9959885409830067
0.246327306131219
0.8792138974218081
0.67531260645465
0.8935358034806211
0.5178088319067405
0.9759862713159224
0.9439145657584737
julia> getobs(s, 1:2)
20-element Vector{Float64}:
0.034930450515507694
0.5960151448216738
0.9959885409830067
0.246327306131219
0.8792138974218081
0.67531260645465
0.8935358034806211
0.5178088319067405
0.9759862713159224
0.9439145657584737
0.6611650944902528
0.0678999516165758
0.7265303330511783
0.4733387334578564
0.5900766453884261
0.09660572584674165
0.8162450409901737
0.0512758131627854
0.3055481424109179
0.8606634983122741
From a practical perspective, I'd want shuffleobs(::BatchView)
to shuffle the underlying data, and not just the order of fix batches (What #97 does). As ultimately the loop I'm going for is:
dl = BatchView(x_train; batchsize=10)
dl = shuffleobs(dl) # An in place suffleobs! would be nice
buffer = getobs(dl, 1)
for bdx in 1:length(dl)
getobs!(buffer, dl, bdx)
... # Fit some model
end
I can open a PR to add shuffleobs(::AbstractRNG, ::BatchView)
if that's of interest.
Best,
Alex
For example, in the single batch case I'd want the batches to be different ( Results are from main
)
julia> using MLUtils
julia> x = rand(10);
julia> x_train, x_val = splitobs(x; at=0.5);
julia> dl = BatchView(x_train; batchsize=5)
BatchView{Vector{Float64}, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Val{nothing}}([0.5067688795812658, 0.28329526479397416, 0.5981029438210473, 0.8871276970763463, 0.7243634878808937], 5, 1, true, 5)
julia> dls = shuffleobs(dl)
ObsView(BatchView(view(::Vector{Float64}, 1:5), batchsize=5, partial=true), ::Vector{Int64})
1 observations
julia> getobs(dls) != getobs(dl)
false
Whereas I'd want getobs(dls) != getobs(dl)
to be true.
@awadell1 for that use case we have eachobs(data, batchsize=5, shuffle=true)
or DataLoader(data, batchsize=5, shuffle=true)
, shuffling the data before batching. Does that satisfy your needs?
They're a lot slower for my data type, I think partially due to the generator in Dataloader
/ eachobs
# dl_train is a BatchViw of a custom type, with `getobs`, `getobs!` and `length` implemented
julia> dl = DataLoader(dl_train.data; batchsize=32, shuffle=true)
julia> _, state = iterate(dl);
julia> @benchmark iterate($dl, $state)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 177.434 μs … 10.176 ms ┊ GC (min … max): 0.00% … 97.63%
Time (median): 180.730 μs ┊ GC (median): 0.00%
Time (mean ± σ): 218.006 μs ± 548.341 μs ┊ GC (mean ± σ): 16.58% ± 6.45%
▁▃▆██▇▆▅▃▁ ▁▂▃▂▁
▂▂▂▃▄▅▆███████████▇▇█████████▆▆▅▄▃▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▄
177 μs Histogram: frequency by time 190 μs <
Memory estimate: 196.91 KiB, allocs estimate: 267.
julia> _, state = iterate(dl_train)
(ObsView(::DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, ::Vector{Int64})
32 observations, 2)
julia> @benchmark iterate($dl_train, $state)
BenchmarkTools.Trial: 10000 samples with 952 evaluations.
Range (min … max): 93.274 ns … 12.577 μs ┊ GC (min … max): 0.00% … 98.26%
Time (median): 100.542 ns ┊ GC (median): 0.00%
Time (mean ± σ): 204.966 ns ± 1.058 μs ┊ GC (mean ± σ): 47.86% ± 9.11%
▃█
██▇▅▅▄▅▅▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
93.3 ns Histogram: frequency by time 223 ns <
Memory estimate: 416 bytes, allocs estimate: 2.
The issue looks to be the generator in eachobs
/ Dataloader
getting boxed:
Dataloader
on v0.2.5
julia> _, state = iterate(dl);
julia> @code_warntype iterate(dl, state)
MethodInstance for iterate(::DataLoader{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Random._GLOBAL_RNG}, ::Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64})
from iterate(d::DataLoader, state) in MLUtils at /home/awadell/.julia/packages/MLUtils/W3W0A/src/dataloader.jl:98
Arguments
#self#::Core.Const(iterate)
d::DataLoader{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Random._GLOBAL_RNG}
state::Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}
Locals
@_4::Int64
res::Union{Nothing, Tuple{Any, Int64}}
i::Int64
gen::Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}
Body::Union{Nothing, Tuple{Any, Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}}}
1 ─ nothing
│ %2 = Base.indexed_iterate(state, 1)::Core.PartialStruct(Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}, Any[Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Core.Const(2)])
│ (gen = Core.getfield(%2, 1))
│ (@_4 = Core.getfield(%2, 2))
│ %5 = Base.indexed_iterate(state, 2, @_4::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│ (i = Core.getfield(%5, 1))
│ (res = MLUtils.iterate(gen, i))
│ %8 = (res === MLUtils.nothing)::Bool
└── goto #3 if not %8
2 ─ return nothing
3 ─ %11 = Base.getindex(res::Tuple{Any, Int64}, 1)::Any
│ %12 = gen::Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}
│ %13 = Base.getindex(res::Tuple{Any, Int64}, 2)::Int64
│ %14 = Core.tuple(%12, %13)::Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}
│ %15 = Core.tuple(%11, %14)::Tuple{Any, Tuple{Base.Generator{UnitRange{Int64}, MLUtils.var"#34#36"}, Int64}}
└── return %15
BatchView + getobs!
on v0.2.5
julia> _, state = iterate(dl_train);
julia> @code_warntype iterate(dl_train, state)
MethodInstance for iterate(::BatchView{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}}, ::Int64)
from iterate(A::BatchView, state) in MLUtils at /home/awadell/.julia/packages/MLUtils/W3W0A/src/batchview.jl:123
Arguments
#self#::Core.Const(iterate)
A::BatchView{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}}
state::Int64
Body::Union{Nothing, Tuple{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Int64}}
1 ─ %1 = MLUtils.numobs(A)::Int64
│ %2 = (state > %1)::Bool
└── goto #3 if not %2
2 ─ return MLUtils.nothing
3 ─ %5 = Base.getindex(A, state)::ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}
│ %6 = (state + 1)::Int64
│ %7 = Core.tuple(%5, %6)::Tuple{ObsView{DeepDFN.Dataloader.FixedDuration{DFNTrace{Float32}, DeepDFN.Dataloader.InMemoryDFNDataset{DFNTrace{Float32}}}, Vector{Int64}}, Int64}
└── return %7
Can you give a fully reproducible example? I don't see this large discrepancy:
using MLUtils, BenchmarkTools
x = rand(10, 1000)
dl = DataLoader(x, batchsize=32, shuffle=true)
_, statedl = iterate(dl)
bv = BatchView(shuffleobs(x), batchsize=32)
_, statebv = iterate(bv)
@btime iterate($dl, $statedl); # 640.423 ns (5 allocations: 3.08 KiB)
@btime iterate($bv, $statebv); # 544.247 ns (2 allocations: 2.66 KiB)
I copied and pasted your example into a fresh julia session and got the following results
System | MLUtils | Dataloader | BatchView |
---|---|---|---|
AMD EPYC 7713, openSUSE Leap v15.3 | 0.2.5 | 531.260 ns | 43.526 ns |
2020 M1 Mac | 0.2.5 | 492.485 ns | 88.209 ns |
AMD EPYC 7713, openSUSE Leap v15.3 | 46e9f2cb | 587.217 ns | 511.384 ns |
2020 M1 Mac | 46e9f2cb | 583.333 ns | 418.553 ns |
I'm guessing the performance regression is due to the increase in allocations:
julia> @btime iterate($bv, $statebv); # M1 Mac @ 46e9f2cb
418.553 ns (2 allocations: 2.66 KiB)
julia> @btime iterate($bv, $statebv); # M1 Max @ v0.2.5
88.209 ns (2 allocations: 400 bytes)
using Pkg
Pkg.activate(; temp=true)
Pkg.add(name="MLUtils", rev="main")
Pkg.add("BenchmarkTools")
Pkg.add("CUDA")
Pkg.add("Flux")
Pkg.add("Adapt")
using MLUtils, BenchmarkTools, Flux, Adapt, CUDA
struct Wrapper{T}
data::T
end
Adapt.@adapt_structure Wrapper
Wrapper(x::T) where {T} = Wrapper{T}(deepcopy(x))
struct CustomType{T,N}
data::Array{T, N}
function CustomType{T}(n...) where T
N = length(n)
new{T,N}(rand(T, n...))
end
end
CustomType(n...) = CustomType{Float32}(n...)
Base.length(x::CustomType{T, N}) where {T, N} = size(x.data, N)
MLUtils.getobs(x::CustomType{T, 4}, i) where {T} = Wrapper(x.data[:, :, :, i])
function MLUtils.getobs!(buffer::Wrapper, x::CustomType{T, N}, i) where {T, N}
buffer.data .= selectdim(x.data, N, i)
return buffer
end
function MLUtils.getobs!(buffer, A::MLUtils.BatchView, i)
obsindices = MLUtils._batchrange(A, i)
return getobs!(buffer, A.data, obsindices)
end
x = CustomType(64, 64, 128, 512);
# Benchmark getobs
buffer = getobs(x, 1);
@benchmark getobs($x, idx) setup=(idx=rand(1:length(x)))
@benchmark getobs!($buffer, $x, idx) setup=(idx=rand(1:length(x)))
# Benchmark Iterate
dl = DataLoader(x, batchsize=32, shuffle=false);
_, statedl = iterate(dl);
bv = BatchView(shuffleobs(x), batchsize=32);
_, statebv = iterate(bv);
@benchmark iterate($dl, $statedl)
@benchmark iterate($bv, $statebv)
@benchmark getobs!(bv_buffer, $bv, idx) setup=(idx=rand(1:length(bv)); bv_buffer = getobs(bv, 1))
function foo(dl)
y = 0.0
for x in dl
x_gpu = gpu(x)
y += sum(x_gpu.data)
end
return y
end
function foo(buffer, dl)
y = 0.0
for bdx in 1:length(dl)
getobs!(buffer, dl, bdx)
x_gpu = gpu(buffer)
y += sum(x_gpu.data)
end
return y
end
@benchmark CUDA.@sync(foo(dl))
@benchmark CUDA.@sync(foo(bv_buffer, bv)) setup=(bv_buffer = getobs(bv, 1))
Just setting a baseline for data access. Not surprisingly getobs!
is faster.
julia> @benchmark getobs($x, idx) setup=(idx=rand(1:length(x)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 51.647 μs … 3.101 ms ┊ GC (min … max): 0.00% … 98.13%
Time (median): 53.451 μs ┊ GC (median): 0.00%
Time (mean ± σ): 65.234 μs ± 82.251 μs ┊ GC (mean ± σ): 4.40% ± 4.06%
▆█▆▁ ▂▃▃▂▁ ▁▂▂▂▂▁ ▂
████▇██▆▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▄▄▃▁▃▄▃▁▁▃▁▁▄▁▁▅███████▇███████ █
51.6 μs Histogram: log(frequency) by time 123 μs <
Memory estimate: 256.48 KiB, allocs estimate: 8.
julia> @benchmark getobs!($buffer, $x, idx) setup=(idx=rand(1:length(x)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.710 μs … 43.372 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.261 μs ┊ GC (median): 0.00%
Time (mean ± σ): 11.308 μs ± 1.010 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁▆█▇███▁▃▂ ▃▄▃▃▄ ▁
▁▄████▅▄▄▃▂▂▃▂▃▃▄▄▅▆▇▆██████████▇▇▇▆▇██████▇█▇▇▄▃▂▂▂▂▂▂▁▂▁▁ ▄
10.7 μs Histogram: frequency by time 11.9 μs <
Memory estimate: 48 bytes, allocs estimate: 3.
I do get similar performance for BatchView and Dataloader here, I think in part due to both dispatching to getobs
instead of getobs
. I'm surprised that explictily using getobs!
(And defining getobs!(buffer, x::BatchView, idx)
) gave worse performance than the allocating versions. Any insight into what's going on here?
julia> @benchmark iterate($dl, $statedl)
BenchmarkTools.Trial: 2626 samples with 1 evaluation.
Range (min … max): 1.755 ms … 7.785 ms ┊ GC (min … max): 0.00% … 32.04%
Time (median): 1.765 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.901 ms ± 462.657 μs ┊ GC (mean ± σ): 3.19% ± 7.53%
█ ▁ ▁
█▃▁▁▄▅▁▁▁▁▁▃▁▁▇█▇▄▁▁▁▁▁▃▃▁▁▁▁▃██▆▄▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▃▃█▇ █
1.75 ms Histogram: log(frequency) by time 4.01 ms <
Memory estimate: 8.00 MiB, allocs estimate: 11.
julia> @benchmark iterate($bv, $statebv)
BenchmarkTools.Trial: 2630 samples with 1 evaluation.
Range (min … max): 1.749 ms … 5.916 ms ┊ GC (min … max): 0.00% … 47.60%
Time (median): 1.762 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.898 ms ± 431.849 μs ┊ GC (mean ± σ): 3.51% ± 8.32%
█ ▂
█▅▃▁▁▅▁▁▁▃▁▁▃▁▁▁▁▁▃▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▃█▇▅▇█▇▄▁▁▃▇▅█▇▃▁▁▁▁▁▁▆ █
1.75 ms Histogram: log(frequency) by time 3.64 ms <
Memory estimate: 8.00 MiB, allocs estimate: 9.
julia> @benchmark getobs!(bv_buffer, $bv, idx) setup=(idx=rand(1:length(bv)); bv_buffer = getobs(bv, 1))
BenchmarkTools.Trial: 1152 samples with 1 evaluation.
Range (min … max): 2.394 ms … 3.458 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.402 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.409 ms ± 48.564 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂██▇▁
█████▅▁▅▁▄▁▁▄▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▄▁▁▁▅▇▆▆ █
2.39 ms Histogram: log(frequency) by time 2.6 ms <
Memory estimate: 416 bytes, allocs estimate: 3.
My first quess was julia was being clever and avoiding materializing the allocating versions, so I tried forcing the issue by moving the data to the GPU. But again the non-allocating version is slower:
julia> @benchmark CUDA.@sync(foo(dl))
BenchmarkTools.Trial: 68 samples with 1 evaluation.
Range (min … max): 70.561 ms … 83.608 ms ┊ GC (min … max): 2.46% … 8.67%
Time (median): 74.843 ms ┊ GC (median): 5.28%
Time (mean ± σ): 74.183 ms ± 1.779 ms ┊ GC (mean ± σ): 4.99% ± 0.91%
▃█ ▃▆ █▂
▇▁▁▁▁▁▁▄▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▅██▅▁▁▁▁▁▁▁▄▅▁▁▁▁▁▁▁▄███▇██▄▄▁▁▁▁▁▁▄ ▁
70.6 ms Histogram: frequency by time 76.1 ms <
Memory estimate: 256.22 MiB, allocs estimate: 4110.
julia> @benchmark CUDA.@sync(foo(bv_buffer, bv)) setup=(bv_buffer = getobs(bv, 1))
BenchmarkTools.Trial: 55 samples with 1 evaluation.
Range (min … max): 88.198 ms … 90.607 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 88.628 ms ┊ GC (median): 0.00%
Time (mean ± σ): 88.598 ms ± 395.388 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆▆▃ ▃▃██ ▁
████▁▄▁▁▄▁▁▄▁▁▁▁▁▄████▇▄▇▁▁█▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▄ ▁
88.2 ms Histogram: frequency by time 89.5 ms <
Memory estimate: 205.31 KiB, allocs estimate: 3522.
@awadell1 Could you open some actionable issues based on these benchmarks?
buffer=true
in DataLoader. Is there a performance issue compared to manually calling getobs!
?
The following gives a shuffled ObsView of the underlying data instead of shuffling the underlying data but maintaining the batch view
Gives:
Instead of something more akin to
BatchView(shuffleobs(dl.data); batchsize=dl.batchsize, partial=dl.partial)