RFC: 0.5 updates and redesign

tbreloff commented 8 years ago

Ref: https://github.com/JuliaML/MLDataUtils.jl/issues/3#issuecomment-245670963

@Evizero I want you to be on board with this redesign. I've already implemented random minibatches with a pattern:

for batch in batches(inputs, targets, batch_size=100)
    ...
    for (input, target) in batch
        ...
    end
    ...
end

and I like the resulting usage. I use it in StochasticOptimization.jl now: https://github.com/JuliaML/StochasticOptimization.jl/blob/master/test/runtests.jl#L37

What do you think? Should I roll out similar updates for KFolds, etc?

Evizero commented 8 years ago

I am going to be straight with you. I am really unsure of this all. It starts with the completely different type tree and seeing KFolds as a DataIterator. I could get behind solving RandomSamples vs LabeledRandomSamples differently, or even maybe adding a third inner layer of having a "Data wrapper" in the inner loop. But at a first glance this seems like a completely different vision of how to perform data handling that I am not yet sure I share.

Let's maybe start with something that is easier communicated. When you look at code examples in the README at the moment. How would they look like with all your envisioned changes implemented?

tbreloff commented 8 years ago

One point to make... there has been a lot of big improvements to the array view architecture, and I think it's a worthwhile goal to try to avoid copying and allocating new arrays whenever possible. One can always throw in a copy(x) if they want a new array, but they can't go the other way. Much of the change in logic is in using views instead of copying.

Aside from that, a core difference in the minibatches is that you have:

for (batch_X, batch_y) in MiniBatches(train_X, train_y, size=10)
        # ... train supervised model here
    end

and I have:

for batch in batches(x, y, batch_size = 10)
    ...
end

There are a couple of ways to resolve this... one is to get your logic with get(batch):

for batch in batches(x, y, batch_size = 10)
    batch_x, batch_y = get(batch)
    ...
end

or we could support both versions with different methods (the only difference is that one would call get(batch) inside Base.next.

Also to ease some of your general concerns about views, hopefully this helps show what you can do:

julia> x = rand(10)
10-element Array{Float64,1}:
 0.666315
 0.416924
 0.38141
 0.561459
 0.213298
 0.665091
 0.428458
 0.368679
 0.132488
 0.591735

julia> x[[1,3,5]]
3-element Array{Float64,1}:
 0.666315
 0.38141
 0.213298

julia> view(x, [1,3,5])
3-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.666315
 0.38141
 0.213298

julia> copy(ans)
3-element Array{Float64,1}:
 0.666315
 0.38141
 0.213298

julia> y = view(x, [1,3,5])
3-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.666315
 0.38141
 0.213298

julia> z = view(y, 2:3)
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.38141
 0.213298

julia> z[1] = 1000
1000

julia> x
10-element Array{Float64,1}:
    0.666315
    0.416924
 1000.0
    0.561459
    0.213298
    0.665091
    0.428458
    0.368679
    0.132488
    0.591735

julia> y
3-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
    0.666315
 1000.0
    0.213298

julia> z
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 1000.0
    0.213298

tl;dr I think we can still support the examples from the readme, but under the hood we would replace "normal/labeled" versions with a single version that supports Tuples of data, and we would avoid copying and use views.

tbreloff commented 8 years ago

And a superficial point... I think the lowercase api looks nicer, and matches julia style more. For example we do:

for i in eachindex(x)
    ...
end

not

for i in IndexIterator(x)
    ...
end

Again this is superficial and shouldn't distract from the core proposal which is to use more views and to simplify/generalize the code.

Evizero commented 8 years ago

Ok. Well, I agree on the view vs copy point.

You know, this example

for (batch_X, batch_y) in MiniBatches(train_X, train_y, size=10)
        # ... train supervised model here
end

can also be written as

for batch in MiniBatches(train_X, train_y, size=10)
        # ... train supervised model here
end

since batch is a tuple. I think this is a.) more general and b.) forces less things on the user as batch_X and batch_y are of the type the user expects instead of some custom data structure.

I do not see the benefit of making sure that batch is of the same type for supervised and unsupervised examples. For unsupervised it is the feature batch and for supervised it is a tuple of feature batch and target batch, as one would expect

Evizero commented 8 years ago

In other words, the main thing I see in your batches example is that it takes away the nice splicing option from the user.

tbreloff commented 8 years ago

You know, this example ... can also be written as ... since batch is a tuple.

Yes, but then to be able to iterate through batch one observation at a time you'd need to re-wrap the tuple with an "observation iterator" of some sort. You can't just do for (x,y) in batch ...like I wanted.

I think making a Batch type was a mistake... all it is is a DataSubset. Likewise, test/train split and minibatches are just examples of "data partitions". In my mind,

DataSubset is a lightweight (non-copy) wrapper around one or more datasets
DataSubsets is an iterator/view which returns some number of DataSubset
test/train split and similar just return a Tuple{DataSubset}
minibatches and similar are just special ways to construct a DataSubsets
kfolds, LOOfolds, etc iterate by returning a Tuple{DataSubset} at each step (passing through params to test/train split)
you could call get on a DataSubset to get the block views: x, y = get(subset) and there could be a way to choose how to iterate (do we call get automatically?)
iterating over a DataSubset will give you a tuple of one observation for each underlying dataset

The net result is that the type tree is highly compressed and more general (allowing iteration over more than 2 datasets), and the functionality is just different ways to build these simple/common types.

If you're really uncomfortable with all this, I'll add it to StochasticOptimization instead (since no one else is using it yet) and leave MLDataUtils alone, and then we can compare side-by-side and have a discussion again in the future.

tbreloff commented 8 years ago

I'm adding a very simple version of this concept to StochasticOptimization so that you don't have to feel pressured into making drastic changes to MLDataUtils yet. We can discuss again sometime in the future after I've experimented a little more.

JuliaML / MLDataUtils.jl

RFC: 0.5 updates and redesign #11