Closed tbreloff closed 8 years ago
I am going to be straight with you. I am really unsure of this all. It starts with the completely different type tree and seeing KFolds as a DataIterator. I could get behind solving RandomSamples
vs LabeledRandomSamples
differently, or even maybe adding a third inner layer of having a "Data wrapper" in the inner loop. But at a first glance this seems like a completely different vision of how to perform data handling that I am not yet sure I share.
Let's maybe start with something that is easier communicated. When you look at code examples in the README at the moment. How would they look like with all your envisioned changes implemented?
One point to make... there has been a lot of big improvements to the array view architecture, and I think it's a worthwhile goal to try to avoid copying and allocating new arrays whenever possible. One can always throw in a copy(x)
if they want a new array, but they can't go the other way. Much of the change in logic is in using views instead of copying.
Aside from that, a core difference in the minibatches is that you have:
for (batch_X, batch_y) in MiniBatches(train_X, train_y, size=10)
# ... train supervised model here
end
and I have:
for batch in batches(x, y, batch_size = 10)
...
end
There are a couple of ways to resolve this... one is to get your logic with get(batch)
:
for batch in batches(x, y, batch_size = 10)
batch_x, batch_y = get(batch)
...
end
or we could support both versions with different methods (the only difference is that one would call get(batch)
inside Base.next
.
Also to ease some of your general concerns about views, hopefully this helps show what you can do:
julia> x = rand(10)
10-element Array{Float64,1}:
0.666315
0.416924
0.38141
0.561459
0.213298
0.665091
0.428458
0.368679
0.132488
0.591735
julia> x[[1,3,5]]
3-element Array{Float64,1}:
0.666315
0.38141
0.213298
julia> view(x, [1,3,5])
3-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
0.666315
0.38141
0.213298
julia> copy(ans)
3-element Array{Float64,1}:
0.666315
0.38141
0.213298
julia> y = view(x, [1,3,5])
3-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
0.666315
0.38141
0.213298
julia> z = view(y, 2:3)
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
0.38141
0.213298
julia> z[1] = 1000
1000
julia> x
10-element Array{Float64,1}:
0.666315
0.416924
1000.0
0.561459
0.213298
0.665091
0.428458
0.368679
0.132488
0.591735
julia> y
3-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
0.666315
1000.0
0.213298
julia> z
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
1000.0
0.213298
tl;dr I think we can still support the examples from the readme, but under the hood we would replace "normal/labeled" versions with a single version that supports Tuples of data, and we would avoid copying and use views.
And a superficial point... I think the lowercase api looks nicer, and matches julia style more. For example we do:
for i in eachindex(x)
...
end
not
for i in IndexIterator(x)
...
end
Again this is superficial and shouldn't distract from the core proposal which is to use more views and to simplify/generalize the code.
Ok. Well, I agree on the view vs copy point.
You know, this example
for (batch_X, batch_y) in MiniBatches(train_X, train_y, size=10)
# ... train supervised model here
end
can also be written as
for batch in MiniBatches(train_X, train_y, size=10)
# ... train supervised model here
end
since batch
is a tuple. I think this is a.) more general and b.) forces less things on the user as batch_X
and batch_y
are of the type the user expects instead of some custom data structure.
I do not see the benefit of making sure that batch
is of the same type for supervised and unsupervised examples. For unsupervised it is the feature batch and for supervised it is a tuple of feature batch and target batch, as one would expect
In other words, the main thing I see in your batches example is that it takes away the nice splicing option from the user.
You know, this example ... can also be written as ... since batch is a tuple.
Yes, but then to be able to iterate through batch
one observation at a time you'd need to re-wrap the tuple with an "observation iterator" of some sort. You can't just do for (x,y) in batch ...
like I wanted.
I think making a Batch
type was a mistake... all it is is a DataSubset
. Likewise, test/train split and minibatches are just examples of "data partitions". In my mind,
DataSubset
is a lightweight (non-copy) wrapper around one or more datasetsDataSubsets
is an iterator/view which returns some number of DataSubset
Tuple{DataSubset}
DataSubsets
Tuple{DataSubset}
at each step (passing through params to test/train split)get
on a DataSubset
to get the block views: x, y = get(subset)
and there could be a way to choose how to iterate (do we call get
automatically?)DataSubset
will give you a tuple of one observation for each underlying datasetThe net result is that the type tree is highly compressed and more general (allowing iteration over more than 2 datasets), and the functionality is just different ways to build these simple/common types.
If you're really uncomfortable with all this, I'll add it to StochasticOptimization instead (since no one else is using it yet) and leave MLDataUtils alone, and then we can compare side-by-side and have a discussion again in the future.
I'm adding a very simple version of this concept to StochasticOptimization so that you don't have to feel pressured into making drastic changes to MLDataUtils yet. We can discuss again sometime in the future after I've experimented a little more.
Ref: https://github.com/JuliaML/MLDataUtils.jl/issues/3#issuecomment-245670963
@Evizero I want you to be on board with this redesign. I've already implemented random minibatches with a pattern:
and I like the resulting usage. I use it in StochasticOptimization.jl now: https://github.com/JuliaML/StochasticOptimization.jl/blob/master/test/runtests.jl#L37
What do you think? Should I roll out similar updates for KFolds, etc?