JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

WIP: Data Access Pattern in 0.5 #14

Closed Evizero closed 7 years ago

Evizero commented 7 years ago

WIP implementation for #13


I have updated DataSubset and its tests to 0.5 so far.

Next I'll port Tom's new verbs (in this PR)

Also I shall try to update the package documentation a bit and try avoid breaking changes if I can

Evizero commented 7 years ago

The problem is in the detail though. I could make a flag for the BatchIteratorto be infinite, but for consistency it should then loop over the batches in their true order. The inifite one you have in mind does random subsamples every time. It is conceptually different.

tbreloff commented 7 years ago

Might need to add an extra abstraction for this to work without complicating the other types:

abstract DataIterator
  abstract AbstractObsIterator <: DataIterator
    type ObsIterator <: AbstractObsIterator
    type InfObsIterator <: AbstractObsIterator
  # same for batches
Evizero commented 7 years ago

On that note, can you think of a better name than ObsIterator ? I like the type, but the name could be better

Evizero commented 7 years ago

What if the types reflect the function?

abstract DataIterator

abstract BatchIterator <: DataIterator
immutable Eachbatch <: BatchIterator end
immutable Infbatches <: BatchIterator end

abstract ObsIterator <: DataIterator
immutable Eachobs <: ObsIterator end
immutable Infobs <: Obsiterator end
tbreloff commented 7 years ago

What if the types reflect the function?

Yes I like that. Nitpick... I'd prefer camelcase: EachObs.

I'm wondering if DataSubset can be:

type SubsetObs <: ObsIterator

I'm still not convinced that shuffled and similar should return viewobs and not datasubset.

tbreloff commented 7 years ago

It just really bothers me that DataSubset is a total separate type...

Evizero commented 7 years ago

I'm wondering if DataSubset can be:

I completely removed the ability to iterate over a DataSubset. you don't iterate over a SubArray either, so it doesn't quite add up that it might work when using custom types but not for plain arrays. DataSubset is now just a lazy view and serves as a fall back for types that can't offer such.

tbreloff commented 7 years ago

Sure you can iterate over a subarray just like any other array:

julia> x = rand(10)
10-element Array{Float64,1}:
 0.879267 
 0.7712   
 0.60451  
 0.998378 
 0.674976 
 0.0372644
 0.150353 
 0.460558 
 0.0960781
 0.490523 

julia> y = view(x,2:4)
3-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
 0.7712  
 0.60451 
 0.998378

julia> for yi in y
       @show yi
       end
yi = 0.7711995499255015
yi = 0.604509881730505
yi = 0.998377795508409
Evizero commented 7 years ago

my point is this:

julia> for x in rand(2,5)
       @show x
       end
x = 0.48582956183939885
x = 0.8691764010130829
x = 0.15545875870878745
x = 0.09107139756381688
x = 0.09622235915747268
x = 0.7638308964988003
x = 0.5252940680537728
x = 0.4871337374237992
x = 0.7661749720385271
x = 0.18993652421902274
Evizero commented 7 years ago

code that expects that the given data iterates by default over its "observations", will not work for plain arrays. We should enforce eachobs(data) for such cases

tbreloff commented 7 years ago

I don't understand your point...

To be clear, I view data iterators/subsets as roughly AbstractVector{Observation}... regardless of what the Observation type is. So if you pass a 3D array to a ObsIterator, each Observation will be a 2D array.

Evizero commented 7 years ago

To be clear, I view data iterators/subsets as roughly AbstractVector{Observation}

I disagree for subsets. For some type I may not know what it represents, just that it is a subset of it. A SubArray isn't a vector of observations either. If I subset some data I expect something that conceptually represents the subset, and not an iterator into the observations of it. If I subset some data I am not asking about an iterator into each observation of that, I am asking about a subset placeholder that at some point down the line some code will know how to deal with what it represents.

So if you pass a 3D array to a ObsIterator, each Observation will be a 2D array.

Yes but then you are asking for an iterator into each observation

tbreloff commented 7 years ago

What purpose do you see for subsets if you won't iterate through them (or otherwise access them like a vector of observations) in some way? Why are they there? I think I'm missing your perspective on when you would ever use a subset.

Evizero commented 7 years ago

One use-case I have in mind is for Augmentor. I will have a type DirImageSource which represents a number of images in some directory and it's subdirectories. As an example: I would like to split it into a training and testset, and then maybe further subset 10 random images from the trainingset and then load that subset with getobs to plot them as examples images in the training test.

Evizero commented 7 years ago

or maybe, to move a bit further away from "you could use eachobs for the last step", I want to compute the mean-image from that last subset

Evizero commented 7 years ago

That doesn't mean that any other code will know how to deal with what my special type returns when getobs is called, by the way, there is no such contract. The patterns should just be an easy tool for people to solve their own problems more conveniently, and all they have to do to use these goodies is implement getobs and nobs

tbreloff commented 7 years ago

One use-case I have in mind is for Augmentor.

I'm clearly missing something, because that example seems to perfectly fit the model:

train, test = splitobs(data) # splitobs returns a length-2 batch iterator
subbatch = datasubset(train, rand(1:nobs(train), 10)  # could add a randbatch method to do this?
mean(subbatch)

which by default would call https://github.com/JuliaLang/julia/blob/master/base/statistics.jl#L10 which depends on an iteration inteface and +// being defined for your image types. If you don't have the iteration, then you don't get all this machinery for free.

Evizero commented 7 years ago

No, I don't want to work with the subset abstraction, I want to work with the data that is underneath, whatever it is. Otherwise I can always wrap into eachobs to get these goodies

tbreloff commented 7 years ago

No, I don't want to work with the subset abstraction, I want to work with the data that is underneath,

Can you go into more detail on your Augmentor example, and why you think the subset abstraction will make things more complicated? What (precisely) are the types at each step of your example, and why?

Evizero commented 7 years ago

It's a conceptual thing. map doesn't return an iterator either. view doesn't return an iterator. They could, but they don't

Evizero commented 7 years ago

Not the perfect examples, I grant you though.

tbreloff commented 7 years ago

map doesn't return an iterator either. view doesn't return an iterator.

Ummm... yes they do:

julia> x = rand(5)
5-element Array{Float64,1}:
 0.38471 
 0.159309
 0.999277
 0.996982
 0.109125

julia> for xi in map(sin,x)
           @show xi
       end
xi = 0.3752900649518715
xi = 0.15863594848736212
xi = 0.8410802667167983
xi = 0.8398366067265394
xi = 0.10890871498222936

julia> for xi in view(x,2:3)
           @show xi
       end
xi = 0.15930895331619754
xi = 0.9992772595118062
Evizero commented 7 years ago

with that logic so does my implementation of batches, and splitobs, which return a Vector

Evizero commented 7 years ago

By the way I am not arguing that DataSubset must not implement the iterator interface. I removed it to force the use of eachobs because otherwise code might easily be brittle when called with SubArrays that are suddenly expected to also iterate of observations

tbreloff commented 7 years ago

More and more I'm wanting everything to return a DataIterator, for consistency. Now that we (might) have distinct ObsIterator and BatchIterator abstractions, I think the api will be much less confusing if a "Batch" or "DataSubset" is really just an ObsIterator:

shuffled(data) --> ObsIterator
eachobs(data) --> ObsIterator
randbatch(data, size=10) --> ObsIterator
bs = batches(data) --> BatchIterator
for batch in bs --> batch is ObsIterator
train,test = splitobs(data) --> train/test are ObsIterators
infbat = infinite_batches(train)  --> BatchIterator
for minibatch in infbat --> minibatch is ObsIterator

Now, you don't want this... I get it. But it greatly simplifies the confusion that a user might have. "I call this function and I get one of 2 types of lazy iterators... ok got it!"

The part you really dislike (I think) is that you have to "extract" your data into a non-MLDataUtils container to do something useful with it. Well, I don't think that's necessarily true. And we could make this one step easier if DataIterator <: AbstractVector. If DataIterator really was a vector, and you really do have a AbstractVector{AugmentedImage}, then maybe you wouldn't be so desperate to avoid the DataIterator type?

Evizero commented 7 years ago

shuffled(data) --> ObsIterator

No reason whatsover to do that for an array which has a perfectly good SubArray

Evizero commented 7 years ago

One thing I dislike is imposing unnecessary Types on potential users that might not need it. Especially not those working with arrays.

Evizero commented 7 years ago

I'd like to keep the user experience as native as possible. Not the developer experience. One might never encounter DataSubset if one is working with arrays, or one may never encounter the iterators when working with splitobs and batches which return plain vectors which is a typical thing julia functions do

tbreloff commented 7 years ago

One thing I dislike is imposing unnecessary Types on potential users that might not need it. Especially not those working with arrays.

But conceptually, anyone using this API is stating implicitly "Look, this is a 3D Array, but really it's not, it's a vector of 2D arrays, and I want ways to treat this 3D array like a vector of 2D arrays without copying the data around."

So giving them back an AbstractVector{Matrix} is exactly what they'd be hoping for in most cases.

Evizero commented 7 years ago

Who knows, maybe there will be TableViews as well, which we could then support and then those working with DataFrames would also never see unfamiliar types

tbreloff commented 7 years ago

I'd like to keep the user experience as native as possible. Not the developer experience.

Me too... which is why I'm pushing for this. I think doing developer magic so that you can return SubArrays will make everything more confusing for newcomers.

Evizero commented 7 years ago

Look, this is a 3D Array, but really it's not, it's a vector of 2D arrays

I disagree there. If I have a typical image array of let's say size (64,64,3,100) and i want a subset of 20 out of that, then I don't think of that as a vector of 20 distinct (64,64,3) arrays, I think of it as one (64,64,3,20) array

Evizero commented 7 years ago

I think doing developer magic so that you can return SubArrays will make everything more confusing for newcomers

I see it the other way around, it returns familiar types by default. Probably most users working with images or numeric data work with plain arrays. It can support special user types if a user wants to opt into that ecosystem. For such it has a fallback-subset

tbreloff commented 7 years ago

You really think returning different type structures depending on the specifics of the input data as being more intuitive? I think a user will be forever confused as to what they'll get back from a method.

Evizero commented 7 years ago

returning different type structures depending on the specifics of the input data

sounds like Julia to me

Evizero commented 7 years ago

Also, it is not that convoluted. The subset returns a view, either native for the type or a fallback. Pretty consistent I think

tbreloff commented 7 years ago

The subset returns a view

Except that it doesn't... for the case of Array, it returns a view(data, :, :, 3:7), not a view(data, 3:7) which is what a "data subset" is. This distinction seems natural to you, but I'm pretty confident it's going to be really confusing for your average data scientist that's not also a serious Julia hacker.

Evizero commented 7 years ago

It does enforce the convention that the last dimension denotes the observation in arrays, that is true. But your proposed solution does the same thing

tbreloff commented 7 years ago

I think we should try to get some outside opinions on this... could start a discussion on julia-users?

Evizero commented 7 years ago

Sure. It won't change much for this PR though, as I am running out of time I can spend on it and I would like to not leave it half done

tbreloff commented 7 years ago

Ok, well I think you should do what you want to do, and I'm going to go back to using what I have in StochasticOptimization for now because I can't use what you're proposing without refactoring other code. Maybe we should just have 2 alternate APIs and move on with our lives...

Evizero commented 7 years ago

I suppose it was just a matter of time before we encounter some decision that is not so easily settled. I am sure we will solve this down the line, given some time. Note, that I don't see my way as final, or this discussion as resolved.

Also, even if we undo the DataSubset in the near or far future, I still think we made good progress in the right direction for a lot of things. And thanks for having this discussion in the first place.

tbreloff commented 7 years ago

Yes agreed on all points... I think we made progress in design, even if we ended up on different sides. :)

I'm going to re-do the iteration inside StochasticOptimization as a sub-module that you have to explicitly import, so that a user could choose which approach to use. And maybe someday we can unify the API more. I'm going to use the ideas we discussed here to do:

export
    DataIterator,
        ObsIterator,        # each iteration is an observation T
            EachObs,
            DataSubset,
            InfiniteObs,
        BatchIterator,      # each iteration is a ObsIterator{T}
            EachBatch,
            Batches,
            InfiniteBatches,
        BatchesIterator,    # each iteration is a BatchIterator{T}
            KFolds

abstract DataIterator{T} <: AbstractVector{T}
    abstract ObsIterator{T} <: DataIterator{T}
    abstract BatchIterator{T} <: DataIterator{T}
    abstract BatchesIterator{T} <: DataIterator{T}
Evizero commented 7 years ago

so infinite_obs -> InfiniteObs, and infinite_batches -> InfiniteBatches ?

tbreloff commented 7 years ago

FYI... subtyping from AbstractVector has the unexpected nicety of pretty printing:

julia> using StochasticOptimization.Iteration

julia> train,test = split_obs(rand(10))
2-element StochasticOptimization.Iteration.Batches{Array{Float64,1},Float64,Array{UnitRange{Int64},1}}:
 [0.260174,0.772488,0.875043,0.605563,0.685558,0.923593,0.630983]
 [0.737278,0.963126,0.394369]                                    

julia> train
7-element StochasticOptimization.Iteration.SubsetObs{Array{Float64,1},Float64,UnitRange{Int64}}:
 0.260174
 0.772488
 0.875043
 0.605563
 0.685558
 0.923593
 0.630983
Evizero commented 7 years ago

That does have some appeal. Certainly worth investigating. I wonder if there are any nuances to be aware of or contracts to obey. I think a long while ago I wanted to make something an abstract vector. Oh yea it was in ValueHistories here in an experimental branch. I remember having issues because of that. But that may have just been me doing something stupid

coveralls commented 7 years ago

Coverage Status

Coverage increased (+1.6%) to 99.083% when pulling f28e179795251c0385c34837486a1038214de3e3 on refactor0.5 into d5503fccb665b3ad37b588536896d203695ae3c8 on master.

Evizero commented 7 years ago

It seems setindex! is not considered optional for abstract arrays: http://docs.julialang.org/en/release-0.5/manual/interfaces/#abstract-arrays. I suspect for some good reason. Not sure how we would implement that

Evizero commented 7 years ago

I am sold on AbstractArray by now. Main reason being that it allows using StatsBase.sample

julia> sample(eachobs(X,y), 5)
5-element Array{Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},String},1}:
 ([6.3,3.3,6.0,2.5],"virginica") 
 ([6.6,2.9,4.6,1.3],"versicolor")
 ([6.1,3.0,4.6,1.4],"versicolor")
 ([6.3,2.3,4.4,1.3],"versicolor")
 ([5.0,2.0,3.5,1.0],"versicolor")
Evizero commented 7 years ago

AbstractArray as base-class has it's issues with infinite iterators though. I am experimenting with reproducible indexing for that.