Closed Evizero closed 7 years ago
The problem is in the detail though. I could make a flag for the BatchIterator
to be infinite, but for consistency it should then loop over the batches in their true order. The inifite one you have in mind does random subsamples every time. It is conceptually different.
Might need to add an extra abstraction for this to work without complicating the other types:
abstract DataIterator
abstract AbstractObsIterator <: DataIterator
type ObsIterator <: AbstractObsIterator
type InfObsIterator <: AbstractObsIterator
# same for batches
On that note, can you think of a better name than ObsIterator
? I like the type, but the name could be better
What if the types reflect the function?
abstract DataIterator
abstract BatchIterator <: DataIterator
immutable Eachbatch <: BatchIterator end
immutable Infbatches <: BatchIterator end
abstract ObsIterator <: DataIterator
immutable Eachobs <: ObsIterator end
immutable Infobs <: Obsiterator end
What if the types reflect the function?
Yes I like that. Nitpick... I'd prefer camelcase: EachObs.
I'm wondering if DataSubset can be:
type SubsetObs <: ObsIterator
I'm still not convinced that shuffled
and similar should return viewobs
and not datasubset
.
It just really bothers me that DataSubset is a total separate type...
I'm wondering if DataSubset can be:
I completely removed the ability to iterate over a DataSubset
. you don't iterate over a SubArray
either, so it doesn't quite add up that it might work when using custom types but not for plain arrays. DataSubset
is now just a lazy view and serves as a fall back for types that can't offer such.
Sure you can iterate over a subarray just like any other array:
julia> x = rand(10)
10-element Array{Float64,1}:
0.879267
0.7712
0.60451
0.998378
0.674976
0.0372644
0.150353
0.460558
0.0960781
0.490523
julia> y = view(x,2:4)
3-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
0.7712
0.60451
0.998378
julia> for yi in y
@show yi
end
yi = 0.7711995499255015
yi = 0.604509881730505
yi = 0.998377795508409
my point is this:
julia> for x in rand(2,5)
@show x
end
x = 0.48582956183939885
x = 0.8691764010130829
x = 0.15545875870878745
x = 0.09107139756381688
x = 0.09622235915747268
x = 0.7638308964988003
x = 0.5252940680537728
x = 0.4871337374237992
x = 0.7661749720385271
x = 0.18993652421902274
code that expects that the given data
iterates by default over its "observations", will not work for plain arrays. We should enforce eachobs(data)
for such cases
I don't understand your point...
To be clear, I view data iterators/subsets as roughly AbstractVector{Observation}
... regardless of what the Observation type is. So if you pass a 3D array to a ObsIterator, each Observation will be a 2D array.
To be clear, I view data iterators/subsets as roughly AbstractVector{Observation}
I disagree for subsets. For some type I may not know what it represents, just that it is a subset of it. A SubArray isn't a vector of observations either. If I subset some data I expect something that conceptually represents the subset, and not an iterator into the observations of it. If I subset some data I am not asking about an iterator into each observation of that, I am asking about a subset placeholder that at some point down the line some code will know how to deal with what it represents.
So if you pass a 3D array to a ObsIterator, each Observation will be a 2D array.
Yes but then you are asking for an iterator into each observation
What purpose do you see for subsets if you won't iterate through them (or otherwise access them like a vector of observations) in some way? Why are they there? I think I'm missing your perspective on when you would ever use a subset.
One use-case I have in mind is for Augmentor. I will have a type DirImageSource
which represents a number of images in some directory and it's subdirectories. As an example: I would like to split it into a training and testset, and then maybe further subset 10 random images from the trainingset and then load that subset with getobs
to plot them as examples images in the training test.
or maybe, to move a bit further away from "you could use eachobs for the last step", I want to compute the mean-image from that last subset
That doesn't mean that any other code will know how to deal with what my special type returns when getobs
is called, by the way, there is no such contract. The patterns should just be an easy tool for people to solve their own problems more conveniently, and all they have to do to use these goodies is implement getobs
and nobs
One use-case I have in mind is for Augmentor.
I'm clearly missing something, because that example seems to perfectly fit the model:
train, test = splitobs(data) # splitobs returns a length-2 batch iterator
subbatch = datasubset(train, rand(1:nobs(train), 10) # could add a randbatch method to do this?
mean(subbatch)
which by default would call https://github.com/JuliaLang/julia/blob/master/base/statistics.jl#L10 which depends on an iteration inteface and +
//
being defined for your image types. If you don't have the iteration, then you don't get all this machinery for free.
No, I don't want to work with the subset abstraction, I want to work with the data that is underneath, whatever it is. Otherwise I can always wrap into eachobs
to get these goodies
No, I don't want to work with the subset abstraction, I want to work with the data that is underneath,
Can you go into more detail on your Augmentor example, and why you think the subset abstraction will make things more complicated? What (precisely) are the types at each step of your example, and why?
It's a conceptual thing. map
doesn't return an iterator either. view
doesn't return an iterator. They could, but they don't
Not the perfect examples, I grant you though.
map doesn't return an iterator either. view doesn't return an iterator.
Ummm... yes they do:
julia> x = rand(5)
5-element Array{Float64,1}:
0.38471
0.159309
0.999277
0.996982
0.109125
julia> for xi in map(sin,x)
@show xi
end
xi = 0.3752900649518715
xi = 0.15863594848736212
xi = 0.8410802667167983
xi = 0.8398366067265394
xi = 0.10890871498222936
julia> for xi in view(x,2:3)
@show xi
end
xi = 0.15930895331619754
xi = 0.9992772595118062
with that logic so does my implementation of batches
, and splitobs
, which return a Vector
By the way I am not arguing that DataSubset
must not implement the iterator interface. I removed it to force the use of eachobs
because otherwise code might easily be brittle when called with SubArrays
that are suddenly expected to also iterate of observations
More and more I'm wanting everything to return a DataIterator, for consistency. Now that we (might) have distinct ObsIterator and BatchIterator abstractions, I think the api will be much less confusing if a "Batch" or "DataSubset" is really just an ObsIterator:
shuffled(data) --> ObsIterator
eachobs(data) --> ObsIterator
randbatch(data, size=10) --> ObsIterator
bs = batches(data) --> BatchIterator
for batch in bs --> batch is ObsIterator
train,test = splitobs(data) --> train/test are ObsIterators
infbat = infinite_batches(train) --> BatchIterator
for minibatch in infbat --> minibatch is ObsIterator
Now, you don't want this... I get it. But it greatly simplifies the confusion that a user might have. "I call this function and I get one of 2 types of lazy iterators... ok got it!"
The part you really dislike (I think) is that you have to "extract" your data into a non-MLDataUtils container to do something useful with it. Well, I don't think that's necessarily true. And we could make this one step easier if DataIterator <: AbstractVector
. If DataIterator really was a vector, and you really do have a AbstractVector{AugmentedImage}
, then maybe you wouldn't be so desperate to avoid the DataIterator type?
shuffled(data) --> ObsIterator
No reason whatsover to do that for an array which has a perfectly good SubArray
One thing I dislike is imposing unnecessary Types on potential users that might not need it. Especially not those working with arrays.
I'd like to keep the user experience as native as possible. Not the developer experience. One might never encounter DataSubset if one is working with arrays, or one may never encounter the iterators when working with splitobs
and batches
which return plain vectors which is a typical thing julia functions do
One thing I dislike is imposing unnecessary Types on potential users that might not need it. Especially not those working with arrays.
But conceptually, anyone using this API is stating implicitly "Look, this is a 3D Array, but really it's not, it's a vector of 2D arrays, and I want ways to treat this 3D array like a vector of 2D arrays without copying the data around."
So giving them back an AbstractVector{Matrix}
is exactly what they'd be hoping for in most cases.
Who knows, maybe there will be TableViews as well, which we could then support and then those working with DataFrames would also never see unfamiliar types
I'd like to keep the user experience as native as possible. Not the developer experience.
Me too... which is why I'm pushing for this. I think doing developer magic so that you can return SubArrays will make everything more confusing for newcomers.
Look, this is a 3D Array, but really it's not, it's a vector of 2D arrays
I disagree there. If I have a typical image array of let's say size (64,64,3,100) and i want a subset of 20 out of that, then I don't think of that as a vector of 20 distinct (64,64,3) arrays, I think of it as one (64,64,3,20) array
I think doing developer magic so that you can return SubArrays will make everything more confusing for newcomers
I see it the other way around, it returns familiar types by default. Probably most users working with images or numeric data work with plain arrays. It can support special user types if a user wants to opt into that ecosystem. For such it has a fallback-subset
You really think returning different type structures depending on the specifics of the input data as being more intuitive? I think a user will be forever confused as to what they'll get back from a method.
returning different type structures depending on the specifics of the input data
sounds like Julia to me
Also, it is not that convoluted. The subset returns a view, either native for the type or a fallback. Pretty consistent I think
The subset returns a view
Except that it doesn't... for the case of Array, it returns a view(data, :, :, 3:7)
, not a view(data, 3:7)
which is what a "data subset" is. This distinction seems natural to you, but I'm pretty confident it's going to be really confusing for your average data scientist that's not also a serious Julia hacker.
It does enforce the convention that the last dimension denotes the observation in arrays, that is true. But your proposed solution does the same thing
I think we should try to get some outside opinions on this... could start a discussion on julia-users?
Sure. It won't change much for this PR though, as I am running out of time I can spend on it and I would like to not leave it half done
Ok, well I think you should do what you want to do, and I'm going to go back to using what I have in StochasticOptimization for now because I can't use what you're proposing without refactoring other code. Maybe we should just have 2 alternate APIs and move on with our lives...
I suppose it was just a matter of time before we encounter some decision that is not so easily settled. I am sure we will solve this down the line, given some time. Note, that I don't see my way as final, or this discussion as resolved.
Also, even if we undo the DataSubset
in the near or far future, I still think we made good progress in the right direction for a lot of things. And thanks for having this discussion in the first place.
Yes agreed on all points... I think we made progress in design, even if we ended up on different sides. :)
I'm going to re-do the iteration inside StochasticOptimization as a sub-module that you have to explicitly import, so that a user could choose which approach to use. And maybe someday we can unify the API more. I'm going to use the ideas we discussed here to do:
export
DataIterator,
ObsIterator, # each iteration is an observation T
EachObs,
DataSubset,
InfiniteObs,
BatchIterator, # each iteration is a ObsIterator{T}
EachBatch,
Batches,
InfiniteBatches,
BatchesIterator, # each iteration is a BatchIterator{T}
KFolds
abstract DataIterator{T} <: AbstractVector{T}
abstract ObsIterator{T} <: DataIterator{T}
abstract BatchIterator{T} <: DataIterator{T}
abstract BatchesIterator{T} <: DataIterator{T}
so infinite_obs
-> InfiniteObs
, and infinite_batches
-> InfiniteBatches
?
FYI... subtyping from AbstractVector has the unexpected nicety of pretty printing:
julia> using StochasticOptimization.Iteration
julia> train,test = split_obs(rand(10))
2-element StochasticOptimization.Iteration.Batches{Array{Float64,1},Float64,Array{UnitRange{Int64},1}}:
[0.260174,0.772488,0.875043,0.605563,0.685558,0.923593,0.630983]
[0.737278,0.963126,0.394369]
julia> train
7-element StochasticOptimization.Iteration.SubsetObs{Array{Float64,1},Float64,UnitRange{Int64}}:
0.260174
0.772488
0.875043
0.605563
0.685558
0.923593
0.630983
That does have some appeal. Certainly worth investigating. I wonder if there are any nuances to be aware of or contracts to obey. I think a long while ago I wanted to make something an abstract vector. Oh yea it was in ValueHistories here in an experimental branch. I remember having issues because of that. But that may have just been me doing something stupid
It seems setindex!
is not considered optional for abstract arrays: http://docs.julialang.org/en/release-0.5/manual/interfaces/#abstract-arrays. I suspect for some good reason.
Not sure how we would implement that
I am sold on AbstractArray by now. Main reason being that it allows using StatsBase.sample
julia> sample(eachobs(X,y), 5)
5-element Array{Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},String},1}:
([6.3,3.3,6.0,2.5],"virginica")
([6.6,2.9,4.6,1.3],"versicolor")
([6.1,3.0,4.6,1.4],"versicolor")
([6.3,2.3,4.4,1.3],"versicolor")
([5.0,2.0,3.5,1.0],"versicolor")
AbstractArray
as base-class has it's issues with infinite iterators though. I am experimenting with reproducible indexing for that.
WIP implementation for #13
I have updated
DataSubset
and its tests to 0.5 so far.datasubset
Next I'll port Tom's new verbs (in this PR)
eachobs
andshuffled
implementation and testseachbatch
documentation and testsDataIterator
documentationsplitobs
andbatches
implementation and testsviewobs
implementation and testsDataIterator
intoBatchIterator
andObsIterator
filterobs
implementation and testsinfinite_batches
andinfinite_obs
KFolds
,kfolds
,leaveout
implementation and testsAlso I shall try to update the package documentation a bit and try avoid breaking changes if I can
getobs
andnobs