Dataset Abstractions and Access Patterns

Evizero commented 8 years ago

I am copying @tbreloff post from: https://github.com/Evizero/LearnBase.jl/issues/14#issuecomment-193793222 since I think this is the right place

Ok I'll try to add to MLDataUtils... again it will probably follow closely to what I built for OnlineAI, but diverge when needed.

Here's a rough list of "data access patterns"... let me know if you think of any others:

[x] Sequential, ordered (time series, etc)

[x] Shuffled (once per data point, but random order)

[x] Partitioned (create multiple iterators, one per data partition)

[x] Random, Infinite (just keep sampling)

[ ] Stratified sampling (sample equally from each target class)

[x] Cross validation (create k partitioned iterators for cross validation)

Evizero commented 8 years ago

I do like your codesnippet a lot:

# given input/target data in some format, get n random samples from the dataset
for (input, target) in RandomSampler(data, n)
    output = fit!(model, input, target)
    # do something with output?
end

# k-fold cross validation.. partition the dataset (without data copies!)
for (train, validate) in CrossValidation(data, k)

    for (input, target) in RandomSampler(train, n)
        # fit, etc
    end

    # access each pair once, sequentially
    for (input, target) in DataIterator(validate)
        # do something with the validation set
    end
end

I also looked at the code you referenced: https://github.com/tbreloff/OnlineAI.jl/blob/dev/src/nnet/data.jl#L136.

I would like to keep the sampler agnostic to what the data actually is. So I do not like the idea of a DataPoint structure. It makes more sense to me that the samplers are generic enough so that one could add to one for special occasions. For example if I want to create a minibatchstream from a directory source I would like the possibility to create a custom next for the MiniBatches iterator functions. Also I don't want to intrinsically force the data to have to be labelled.

Basically I want to use basic Julia datastructures where possible. I'll give this a shot today and see if I can code something up. It's been far too long since I coded some Juliastuff anyway :-)

tbreloff commented 8 years ago

Awesome. I agree that I don't want to copy/paste from OnlineAI... It would be better to lightly wrap existing julia data structures when possible. I look forward to reviewing whatever you come up with!

On Apr 9, 2016, at 3:10 AM, Christof Stocker notifications@github.com wrote:

I do like your codesnippet a lot:

given input/target data in some format, get n random samples from the dataset

for (input, target) in RandomSampler(data, n) output = fit!(model, input, target)

do something with output?

end

k-fold cross validation.. partition the dataset (without data copies!)

for (train, validate) in CrossValidation(data, k)
for (input, target) in RandomSampler(train, n)
    # fit, etc
end

# access each pair once, sequentially
for (input, target) in DataIterator(validate)
    # do something with the validation set
end
end I also looked at the code you referenced: https://github.com/tbreloff/OnlineAI.jl/blob/dev/src/nnet/data.jl#L136.

I would like to keep the sampler agnostic to what the data actually is. So I do not like the idea of a DataPoint structure. It makes more sense to me that the samplers are generic enough so that one could add to one for special occasions. For example if I want to create a minibatchstream from a dictionary source I would like the possibility to create a custom constructor to the Minibatches contructor. Also I don't want to intrinsically force the data to have to be labelled.

Basically I want to use basic Julia datastructures where possible. I'll give this a shot today and see if I can code something up. It's been far too long since I coded some Juliastuff anyway :-)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

tbreloff commented 8 years ago

Ok I'm coming back to this after a while, and I'm thinking about how we might change DataIterators to be more generic and nicer to use. I'm going to think out loud about how I'd like to use it, then we can discuss if/how the code might change. Here's a possible usage: https://github.com/JuliaML/StochasticOptimization.jl/blob/dd846f57cad35e530199c4e2ec00b58907a3de31/src/scratch/learning.jl

There are a few things to note.

I'd like to avoid the separation of "normal iterator" and "iterator with targets". To avoid it we can accept a tuple of arrays and automatically return a tuple of samples when it's appropriate.
There's a distinction between a BatchIterator and a Batch... namely that with a BatchIterator we are looping over chunks of data, and it's assumed we would then subsequently loop over the batch to get individual data points.

Here's a possible type tree:

abstract DataIterator

    # each iteration returns either a Batch or a BatchIterator
    abstract SubsetIterator
        type KFolds

    # each iteration returns a Batch
    abstract BatchIterator
        type RandomBatches

    # each iteration returns a datapoint (might be a single view or a tuple of views)
    abstract Batch
        type SequentialBatch
        type RandomBatch

If this sounds reasonable, I'll try to get a working prototype with the whole JuliaML ecosystem to learn some simple models, then submit a PR.

Evizero commented 8 years ago

I am unsure of this. This seems like more complexity for little gain

what separation of "normal iterator" and "iterator with targets"? I feel like this is really transparently solved. feed two things in, get two things out. Can you think of some sample code to make me understand what you are thinking of?
In your code, why use minibatches but then iterate over each datapoint? seems like a waste of a linAlg opportunity. We want to make use of linalg where possible, because it is free parallelisation. When you don't nest iterators this problem would go away, as one can already iterate one element at a time.

That said, I could see an argument that one would like to nest iterators. mhm. I don't know. I don't particularly like the idea of having a custom type to represent "data". I want to shift that to the user.. in the sense that no matter what data the user works with, our data access pattern work if that data has these two functions defined and give the user in the inner loop what (s)he expects.

tbreloff commented 8 years ago

what separation of "normal iterator" and "iterator with targets"? I feel like this is really transparently solved. feed two things in, get two things out.

Yeah that wouldn't change from the user's perspective. This would be a simplification/generalization of the internals. For example, right now there is RandomSamples and LabeledRandomSamples which is already 1 too many. But what if you actually want to randomly iterate through X, Y, and Z? We'd need a new type XYZRandomSamples?

In your code, why use minibatches but then iterate over each datapoint? seems like a waste of a linAlg opportunity.

This isn't always possible/easy, and of course you can always specialize that the learn!(..., batch) method to do linalg operations when that's best. The reason for the distinction in my example is to collect an average gradient over the minibatch and wait to update the parameters with the average instead of updating on every data point. Again there are cases where you'd skip the minibatch step and just loop through datapoints, updating the parameters online.

I don't particularly like the idea of having a custom type to represent "data".

Maybe you misunderstood? This isn't what I was suggesting. Presumably a user could pass in anything that implements nobs and getobs. I'm suggesting that we'd also have something like:

LearnBase.getobs{T<:Tuple}(tup::T, idx) = map(a -> getobs(a, idx), tup)

so that "tuples of things that can be iterated" are automatically supported. Note that this could be written as a generated function to make it super lightweight. The point is that you could do this with no code changes:

for (a,b,c,d) in RandomSampler(A, B, C, D)
    ...
end

We distinguish batch iteration so that you do:

for batch in batches(A, B, C, D)
    for (a, b, c, d) in batch
        ...
    end
end

where batch is just a lightweight container describing the subset of source data to iterate over. Obviously if you don't want to iterate in batches then you wouldn't use this.

Evizero commented 8 years ago

For example, right now there is RandomSamples and LabeledRandomSamples which is already 1 too many

Yea but one never uses them explicitly. this is just something that is used in the background to make things fast when dispatched on.

Evizero commented 8 years ago

Maybe you are right. I do not know at this point. I do know that there are some new goodies I want to look at in 0.5 that will inform my understanding. Once I get around to it

tbreloff commented 8 years ago

Maybe you are right. I do not know at this point.

I'll get a prototype together so we can compare.

Once I get around to it

0.5 is great! And not too hard to switch. The highlights in my mind are the speed of anonymous functions (and therefore map, etc), and generator expressions like sum(x^2 for x=1:100)

tbreloff commented 8 years ago

I added these definitions:

LearnBase.nobs{T<:Tuple}(tup::T) = nobs(tup[1])
LearnBase.getobs{T<:Tuple}(tup::T, idx) = map(a -> getobs(a, idx), tup)

immutable SequentialBatch{S,I<:AbstractVector} <: Batch
    source::S
    indices::I
end

Base.start(b::SequentialBatch) = 1
Base.done(b::SequentialBatch, i) = i > length(b.indices)
Base.next(b::SequentialBatch, i) = (getobs(b.source, b.indices[i]), i+1)
Base.length(b::SequentialBatch) = length(b.indices)

immutable RandomBatches{S,B<:Batch}
    source::S
    batches::Vector{B}
end

function RandomBatches(source; num_batches::Int = -1, batch_size::Int = -1)
    n = nobs(source)
    batch_size, num_batches = _compute_partitionsettings(n, batch_size, num_batches)
    @assert batch_size > 0 && num_batches > 0
    indices = shuffle(1:n)
    batches = [SequentialBatch(source, indices[i*batch_size+1:(i+1)*batch_size]) for i=0:num_batches-1]
    RandomBatches(source, batches)
end

Base.start(itr::RandomBatches) = start(itr.batches)
Base.done(itr::RandomBatches, i) = done(itr.batches, i)
Base.next(itr::RandomBatches, i) = next(itr.batches, i)
Base.length(itr::RandomBatches) = length(itr.batches)

function batches(arg, args...; kw...)
    if isempty(args)
        RandomBatches(arg; kw...)
    else
        RandomBatches((arg, args...); kw...)
    end
end

then the pseudocode I wrote above works as expected. This is IMO much simpler an implementation. What do you think?

tbreloff commented 8 years ago

Oh and I also simplified the defs for arrays:

# add support for arrays up to 4 dimensions
for N in 1:4
    @eval begin
        # size of last dimension
        LearnBase.nobs{T}(A::AbstractArray{T,$N}) = size(A, $N)

        # apply a view to the last dimension
        LearnBase.getobs{T}(A::AbstractArray{T,$N}, idx) = view(A,  $(fill(:(:),N-1)...), idx)
    end
end

tbreloff commented 8 years ago

And batches can be simplified:

batches(arg, args...; kw...) = RandomBatches(isempty(args) ? arg : (arg, args...); kw...)

Evizero commented 8 years ago

I have been thinking a lot about your recent proposal and the problem it tries to solve. Here are some general remarks

I agree on the idea of using a lower-case API.
I see the importance of a general way to loop through all observations of a minibatch.
- I can't think of a scenario where one does not know if labels are present or not, but I think I have a general solution to it nonetheless
It is of high importance to me that if one is working with plain arrays (or any custom data for that matter) that that person does not have to jump through hoops to do so. This is why I don't like the idea of having some kind of Batch object in the inner loop by default.

My idea to solve this would be something like this:

# X is a matrix of floats
# y is a vector of strings
X, y = load_iris()

# leave out 25 % of data for testing
(cv_X, cv_y), (test_X, test_y) = splitdata(X, y; at = 0.75)

# Partition the data using a 10-fold scheme
for ((train_X, train_y), (val_X, val_y)) in kfolds(cv_X, cv_y, k = 10)

    # Iterate over the data using mini-batches of 5 observations each
    # (Note that batch is not spliced into batch_X and batch_y)
    for batch in minibatches(train_X, train_y, size = 5)
        # batch is Tuple{SubArray{Float64,2}, SubArray{String,1}}  just like it is currently

        for (x, y) in eachobs(batch) # alternatively eachobs(batch_X, batch_y) if preferred
            # do things
        end
    end
end

Would this satisfy your needs? Here the information about labels existing or not would also be hidden and eachobs would work either way. One question that needs to be addressed is what type x should be, a column vector or a row vector.

tbreloff commented 8 years ago

I think we're getting closer. I like eachobs (and in fact added that yesterday). I do think we need to have a distinct api for creating copies vs views. The standard approach should be lazy. One should explicitly ask to make copies (which might be useful if you're doing one giant matrix multiply for example). But I'm ok with returning tuples of views I think.

On Saturday, September 17, 2016, Christof Stocker notifications@github.com wrote:

I have been thinking a lot about your recent proposal and the problem it tries to solve. Here are some general remarks

-

I agree on the idea of using a lower-case API.

I see the importance of a general way to loop through all observations of a minibatch.

I can't think of a scenario where one does not know if labels are

present or not, but I think I have a general solution to it nonetheless

It is of high importance to me that if one is working with plain arrays (or any custom data for that matter) that that person does not have to jump through hoops to do so. This is why I don't like the idea of having some kind of Batch object in the inner loop by default.

My idea to solve this would be something like this:

X is a matrix of floats# y is a vector of strings

X, y = load_iris()

leave out 25 % of data for testing

(cv_X, cv_y), (test_X, test_y) = splitdata(X, y; at = 0.75)

Partition the data using a 10-fold schemefor ((train_X, train_y), (val_X, val_y)) in kfolds(cv_X, cv_y, k = 10)
# Iterate over the data using mini-batches of 5 observations each
# (Note that batch is not spliced into batch_X and batch_y)
for batch in minibatches(train_X, train_y, size = 5)
    # batch is Tuple{Matrix{Float64}, Vector{Float64}}  just like it is currently

    for (x, y) in eachobs(batch) # alternatively eachobs(batch_X, batch_y) if preferred
        # do things
    end
endend
Would this satisfy your needs? Here the information about labels existing or not would also be hidden and eachobs would work either way. One question that needs to be addressed is what type x should be, a column vector or a row vector.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/issues/3#issuecomment-247769757, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492i-1rzE2kydXPidsB6Jg5RGv2-A1ks5qq-uVgaJpZM4IDjWf .

tbreloff commented 8 years ago

And I think the current getobs is best... Returning a view with the last dimension dropped.

On Saturday, September 17, 2016, Tom Breloff tom@breloff.com wrote:

I think we're getting closer. I like eachobs (and in fact added that yesterday). I do think we need to have a distinct api for creating copies vs views. The standard approach should be lazy. One should explicitly ask to make copies (which might be useful if you're doing one giant matrix multiply for example). But I'm ok with returning tuples of views I think.

On Saturday, September 17, 2016, Christof Stocker < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
I have been thinking a lot about your recent proposal and the problem it tries to solve. Here are some general remarks

-

I agree on the idea of using a lower-case API.

I see the importance of a general way to loop through all observations of a minibatch.

I can't think of a scenario where one does not know if labels are

present or not, but I think I have a general solution to it nonetheless

It is of high importance to me that if one is working with plain arrays (or any custom data for that matter) that that person does not have to jump through hoops to do so. This is why I don't like the idea of having some kind of Batch object in the inner loop by default.

My idea to solve this would be something like this:

X is a matrix of floats# y is a vector of strings

X, y = load_iris()

leave out 25 % of data for testing

(cv_X, cv_y), (test_X, test_y) = splitdata(X, y; at = 0.75)

Partition the data using a 10-fold schemefor ((train_X, train_y), (val_X, val_y)) in kfolds(cv_X, cv_y, k = 10)
# Iterate over the data using mini-batches of 5 observations each
# (Note that batch is not spliced into batch_X and batch_y)
for batch in minibatches(train_X, train_y, size = 5)
    # batch is Tuple{Matrix{Float64}, Vector{Float64}}  just like it is currently

    for (x, y) in eachobs(batch) # alternatively eachobs(batch_X, batch_y) if preferred
        # do things
    end
endend
Would this satisfy your needs? Here the information about labels existing or not would also be hidden and eachobs would work either way. One question that needs to be addressed is what type x should be, a column vector or a row vector.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/issues/3#issuecomment-247769757, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492i-1rzE2kydXPidsB6Jg5RGv2-A1ks5qq-uVgaJpZM4IDjWf .

Evizero commented 8 years ago

Right now the idea I am following is internally use views everywhere where it is possible. In 0.4 that was restricted to when ranges where used as indices.

Up to the minibatches line, everything was lazy. In the sense that there were no view generated but only DataSubset. Reason being that this is the most common way to use data. At least in my experience. But yes, it would result in tuple of views basically (for arrays at least). But I think we should not use a tuple if we only have one element (i.e. only X for unsupervised learning).

At a last remark. I hope you don't interpret too much in my recent criticism that were very short and maybe a bit ill explained. I have been very busy with completely different things, but still could not resist to at least participate in discussions on some level. I took all your posts and suggestions very seriously and continue to appreciate your efforts.

I am sure we will find a solution that will serve both our use-cases. We always do

Evizero commented 8 years ago

Concerning copy vs view. I can think of a lot of cases where one wants to create copies instead of views in various places. The two main reasons I can think of are cache locality and transferring data to the GPU. But as you hint it is quite easy to just call deepcopy, or copy in those cases. Plus we give the user control over deciding when the copying should take place in the data access pipeline

tbreloff commented 8 years ago

I'll try to summarize some of the ways we'd like to access, and we can bikeshed on api. Lets assume data with 2 features and 100 observations:

X = rand(2,100) # input matrix
y = rand(100)  # target vector

for x in eachobs(X)
    # x = view(X,:,i)
end

for yi in eachobs(y)
    # yi = y[i]
end

### iterate one observation at a time: obs = (x, yi)

for obs in eachobs(X,y)
    # obs = (view(X,:,i), y[i])
end

for obs in eachobs(shuffled(X,y))
    # obs = (view(X,:,i), y[i])  where i comes from a shuffled indices
end

for obs in infinite_obs(X,y)
    # infinitely sample one observation
end

### iterate one batch/partition at a time: batch = (batch_x, batch_y)

# one pass through the data in chunks of 10
for batch in batches(X, y, size=10)
    # batch = (view(X, :, i:i+10), view(y, i:i+10))
    # we could also do "for (x,yi) in ..."
end

for batch in batches(shuffled(X,y), size=10)
    # same, but the indices are shuffled first
end

# a float for size would imply a fractional split.
train, test = batches(X, y, size = 0.7)

# this would work too, since train and test are tuples of views
(train_x, train_y), (test_x, test_y) = batches(X, y, size = 0.7)

# a tuple or vector of floats could give more than 2 partitions:
train, validate, test = batches(X, y, size = (0.6, 0.2))

for batch in infinite_batches(X,y, size=10)
    # infinitely sample a random batch
end

### iterators over partition-iterators

for (train, test) in kfolds(X, y, k=10)
    # train = (train_x, train_y)
end

for (train, test) in leave_one_out(X, y)
    ...
end

### utils

# lazy filter on index
newX, newy = filterobs(i -> y[i] < 2, X, y)

So I hope this is structured enough that the patterns are clear. Note that many of these could support getindex as well so you wouldn't have to iterate. (edit: this is implemented)

We should come up with a naming scheme which has a copy-version of each of these patterns (copy_partitions, etc?) which could possibly keep a temporary array to hold the copied values. The non-qualified names would return views. (edit: use collect to do this... should probably implement a collect! as well)

Am I missing any common ways to iterate? Any problems with these examples?

Evizero commented 8 years ago

I like most of it except the fusion of a minibatches and a data partition. I can see that they could be treated as the same thing on some level, but I do not think they should. Keep in mind we also want to allow for resampling etc down the line when partitioning data and even this simple example is awkward with the different interpretation of size depending on it's value. Furthermore partitioning should not return views but DataSubset. The reason being that my datatype be some custom thing that is not an array. For example in Augmentor I have a DirImageSource that represents all images in some folder structure as a dataset. I only want to load images there in the inner most loop where I use a minibatch for training.

I don't think there is a way around using a DataSubset as a wildcard if we want to allow for custom data sources that may be on disk or internet.

So in your example I think eachobs should yield what you write, but partition should yield DataSubset. On the other hand minibatches should also yield the actual data, in this case views.

This is why I separate Data Partitioning and Data iteration. The partitioning should be lazy for any kind of data(source) the user would want to use

Evizero commented 8 years ago

Thinking about this a little more. Maybe we can get around DataSubset if the data are arrays by using multiple dispatch. That wasn't possible before, but now views don't have to be connected anymore. I think that should work, and would be nicer, yes.

Regardless I would like to still split partition and minibatches, even if it is just for the name's sake.

ahwillia commented 8 years ago

Some real quick thoughts:

Overall 👍
I'd like to also have eachobs defined for matrix-valued observations, so if X = rand(10,10,10) then eachobs(X) would give view(X,:,:,i). This will work fine if eachobs is a @generated function.
I'd also like for (train, test) in leaveout(X,N) for leave-N-out cross-validation.
sample_forever strikes me as a bit verbose, couldn't we do repeatedly(X) similar to how Iterators.jl does it?

tbreloff commented 8 years ago

I'd like to also have eachobs defined for matrix-valued observations

Yes that was implied (I think)... anything that implements getobs can be used, and we already have generated implementations for arrays.

I'd also like for (train, test) in leaveout(X,N) for leave-N-out cross-validation

This is already in MLDataUtils, so we would include it.

sample_forever strikes me as a bit verbose, couldn't we do repeatedly(X) similar to how Iterators.jl does it?

I think this is similar:

for obs in repeatedly(() -> getobs((X, y), rand(1:nobs((X, y))))
    ...
end

The point is that this would infinitely sample an observation index and call getobs on the source data to retrieve it. So it's not a direct replacement, though I'm happy to hear name suggestions.

Evizero commented 8 years ago

I think we should stick with the _forever verbosity to make sure everyone sees what is going on. i.e. this loop won't stop and has to be stopped manually

tbreloff commented 8 years ago

Regardless I would like to still split partition and minibatches, even if it is just for the name's sake.

I just want to make the quick note that it seems silly to me to limit the functionality of an operator like this. I am, however, ok with alias names, where minibatches would actually call through to partitions but with a slightly different signature.

On that note, can you remind me with exactly what you'd expect minibatches to be? Is it shuffled? Does it make one pass through the data without replacement? I think it's very clear to see partitions(shuffled(X,y)) but I'd have to constantly reference the docs of minibatches(X,y) as I wouldn't know/remember the author's conventions.

tbreloff commented 8 years ago

I just had the thought that it would be really nice to add a filterobs method which calls getobs on a filtered list of indices. An example, trying to limit MNIST to 0's and 1's:

X # image inputs
y # class numbers
X01, y01 = filterobs(i -> y[i] < 2, X, y)  # lazy views, like any other partition

model = ...
for (train, test) in kfolds(X01,y01)
    # fit the model and eval
end

Evizero commented 8 years ago

i like the direction that this idea paves.

One thing we should abstain from though (for now at least) are all the things that are more data-frame specific; such as creating dummy variables for categorical predictors.

tbreloff commented 8 years ago

I'm implementing this right now in StochasticOptimization (with the intention of moving to MLDataUtils later). I've added definitions for Base.collect as a mean to extract copies. So the copy-version of the API would mean just wrapping the existing API call with collect:

for batch in minibatches(x,y)
    xcopy, ycopy = collect(batch)
end

Does that work for your needs @Evizero?

tbreloff commented 8 years ago

Note that for batch in collect(minibatches(x,y)) ... would also work, but it would build a vector of copied data all at once at the beginning and then iterate through it.

tbreloff commented 8 years ago

Crap:

julia> partitions(X,y)
ERROR: partitions(
[0.316747 0.407742 0.336434 0.539063; 0.827402 0.820675 0.223918 0.69351],

[0.955119,0.759813,0.0938363,0.31395]) has been moved to the package Combinatorics.jl.
Run Pkg.add("Combinatorics") to install Combinatorics on Julia v0.5-
 in partitions(::Array{Float64,2}, ::Vararg{Any,N}) at ./deprecated.jl:200

Name change needed...

tbreloff commented 8 years ago

Question... what do you think about using splitdata for any partitioning that is breaking apart the source sequentially (like grouping in buckets of 10, or test/train split, etc), and then minibatches would imply iterating some (possibly infinite) times, returning a random subset at each iteration?

So in the api above:

partition_forever --> minibatches
partitions --> splitdata

I'm open to alternative names... but does the functionality make sense?

Evizero commented 8 years ago

Why collect and not copy? Would maybe deepcopy work out of the box for a Tuple of SubArrays?

I don't think minibatches should be associated with "forever". Typically a minibatchstream just splits the data sequentially and iterates through it once (in a shuffled order or not), denoting an epoch.

tbreloff commented 8 years ago

Why collect and not copy?

It seems more clear to me (I would expect copy to make a copy of the structure, not to extract arrays), but I think either will work.

Typically a minibatchstream just splits the data sequentially and iterates through it once

So what would you call infinite sampling of minibatches?

Do you have an opinion on splitdata (and any other name ideas)?

tbreloff commented 8 years ago

Right now I have:

eachobs
shuffled
infinite_obs (was sample_forever, but even i was confused what i meant by sample)
splitdata (combines @Evizero's splitdata and minibatches... i can be convinced to change this)
infinite_batches (was partition_forever)

Haven't gotten to kfolds or filterobs yet, but I'd say the core api is ready to test. Please try it out and comment!

tbreloff commented 8 years ago

This is ready for review. The entire api is supported, with tests that hopefully cover lots of cases. I renamed splitdata to batches... hopefully that's the right amount of generic. I also implemented filterobs and kfolds/leave_one_out. Please give it a spin!

Evizero commented 8 years ago

I like the direction this is going. I am a bit reserved I will admit. I will need to play around with your code a bit to get a good feeling of the interface. I am leaving for a few days tomorrow, so it will be a week at least before I get to it.

One reservation I have for example is that I am unsure the shuffled approach will work out intuitively. While it does look nice for shuffled itself - which I like - I am not sure if it will for a stratified version. Any idea how that would look in your design?

What I really like so far is eachobs, the option to sample observations forever (although not quite the names yet), and some form of the filterobs idea

Evizero commented 8 years ago

What I mean with my "shuffled" comment is the following.

The current API has a partitiondata intended for randomized datasplitting for training set and test set. This does lend itself for a stratified version of it (i.e. a different function with a similar name), or a parameter specifying to do the sampling in a stratified way.

As far as I can tell in your approach you want to drop the distinction between splitting data and data iteration. So you create a splitdata that does a "static" split, and a shuffled that randomizes the order lazily, which would result in splitdata(shuffled(...)) becoming the substitution for the current partitiondata, which does a randomized split. So far a nice idea.

What I don't quite see is a good way to allow for a stratified data split in your approach. Any ideas?

tbreloff commented 8 years ago

What I don't quite see is a good way to allow for a stratified data split in your approach. Any ideas?

Need to think on it more. But it might be useful to think of it in terms of warped randomness. If there was a class that could probabilistically choose certain indices more often then others, then we could just call rand on that object to get one or more sampled indices. Then getobs(data, indices) would be our sample.

Assuming we only really care about stratified sampling in an infinite_obs or infinite_batches type of call, then we could do something like:

function infinite_stratifed(data)
    idx_list = split_class_indices(data)
    sampler = IndexSampler(idx_list, fill(1/nclasses, nclasses))
    repeatedly(() -> getobs(data, rand(sampler)))
end

Evizero commented 8 years ago

One thing we should aim for is that creating a random split and creating a stratified split has the same kind of interface.

i.e. we should avoid ending up with a situation where a random split looks like splitdata(shuffled(...)), while a stratified split looks like splitdata_stratified(...).

tbreloff commented 8 years ago

But those are very different cases. Unless what you mean is that you want to split the data so that each subset contains all the same class? Stratification assumes several things about your data, remember... specifically that there's a way to ascertain the class of an observation.

One solution:

subsets = stratify(data)

# then we implement rand(subsets::DataSubsets) to first randomly sample a subset,
# and then randomly sample a observation from that subset.
rand(subsets, 10)

tbreloff commented 8 years ago

And the infinite version:

function infinitely_stratified(data)
    subsets = stratify(data)
    repeatedly(() -> rand(subsets))
end

(I know... we need a better naming scheme. It's always the hardest part ;)

Evizero commented 8 years ago

But those are very different cases.

Mhm, maybe. Personally I think of it as different scenarios of preparing my dataset (let's assume splitting training and testset)

Splitting the data at some cutoff-index statically. could be a benchmark set where it is common to use the last 25% for testing for example
Splitting the data in such a way that that the observations are assigned randomly to one partition (but every observation exactly once)
Splitting the data in such a way that the classes are similarly distributed on both partitions by using some resampling scheme
Splitting the data in such a way that the classes are equally distributed everywhere by subsampling

To me it made sense to call the first splitdata and the second partitiondata. I would then probably have called the third something similar, or made it a parameter of partitiondata.

I guess what you mean is that the first two just assign datapoints without changing any properties about the data, while the last two do just that by using some sampling scheme? I guess I can see that. I am certainly warming up to the distinction a bit now

JuliaML / MLDataUtils.jl

Dataset Abstractions and Access Patterns #3

given input/target data in some format, get n random samples from the dataset

do something with output?

k-fold cross validation.. partition the dataset (without data copies!)

I agree on the idea of using a lower-case API.

present or not, but I think I have a general solution to it nonetheless

X is a matrix of floats# y is a vector of strings

leave out 25 % of data for testing

Partition the data using a 10-fold schemefor ((train_X, train_y), (val_X, val_y)) in kfolds(cv_X, cv_y, k = 10)

I agree on the idea of using a lower-case API.

present or not, but I think I have a general solution to it nonetheless

X is a matrix of floats# y is a vector of strings

leave out 25 % of data for testing

Partition the data using a 10-fold schemefor ((train_X, train_y), (val_X, val_y)) in kfolds(cv_X, cv_y, k = 10)