Evizero commented 8 years ago

outdated!! take a look at #15

List of functionality to be implemented / adapted. A continuation of #3 and based on the proposed changes of Tom in datasubsets.jl at StochasticOptimization.jl.

General design decisions

All functions should work on tuples of supported types as well and return the individual results in tuples of same size and ordering. This will allow for labelled data and deprecated the distinction between unlabeled and labeled versions of our types.
- e.g. viewobs((X,Y), 1:5) returns a tuple (viewobs(X,1:5), viewobs(Y,1:5))
For new user-types to opt-in, all they need to implement is getobs(::MyType, [idx]) and nobs(::MyType)
I currently do not see a reason to have an abstract supertype or a DataSubsets. we should discuss this here Edit: solved
Types
[x] DataSubset Lazy subsetting of source data, tracking the indices of observations in the source data. This serves as a ML specific generalization of SubArrays for any kind of datasource.
- datasubset serves as a smart constructor which returns a type-native view of the data (e.g. SubArray for Array)
- it can be indexed into which returns a new flat DataSubset
- which mean that it can be nested, in which case simply the indicies are fused
- getobs performs the actual subsetting when needed
- collect creates a copy ? should be use copy instead? why not?
[x] DataIterator _Abstract Baseclass for all iterators that iterate through data using datasubset, getobs, and nobs
- [x] EachObs For eachobs. iterates through a dataset one observation at a time
- [x] EachBatch For eachbatch. iterates through a dataset a batch at a time. all equal sized
[x] KFolds Iterator over k pairs of (train,test) splits, where each split becomes the testset once
- the sizes of the folds may differ by up to 1 observation depending on if the total number of observations is dividable by k
- fusion of current KFolds and LabeledKFolds
  Functions
[x] getobs(data, [idx]) Returns the observations in their native form
- drop trailing dimensions for arrays
- does NOT return a view. SubArrays will be copied.
- native types, such as Array and SubArray should support this as well
- needs documentation
[x] viewobs(data, [idx]) Returns a view into the observations
- same as the function datasubset
- returns a SubArray for data::(Sub)Array
- returns a DataSubset by default for other types
[x] nobs Returns the number of observations in the data structure
- for arrays this is the size of the last dimension.
- stays pretty much the same, just needs updating
- native types, such as Array and SubArray should support this as well
[x] eachobs(data...) Returns a EachObs of the given data, which will allow iteration through all observation
- The current ordering of the observations in data should be obeyed
[x] eachbatch(data...; size, count) Returns a EachBatch of the given data, which will allow iteration through all observation in count batches of size size
- The current ordering of the observations in data should be obeyed
[x] shuffled(data...) Returns a DataSubset where the ordering of the underlying observations is randomized
- This should only permute the indices.
- for data::(Sub)Array this returns a SubArray with randomized indices into the last dimension
- This will replace RandomSamples
[ ] shuffled!(data...) Returns data, but with the observations shuffled.
- for data::DataSubset this shuffles the indices inplace.
- for data::Array this moves around observations and returns the same Array (no boxing)
- Should we use a different name? shuffleobs! maybe?
[ ] infinite_obs(data...) Returns an InfiniteObs that samples one observation at a time from data randomly and indefinitely
[ ] infinite_batches(data...) Returns an InfiniteBatches that when iterated over samples one batch at a time from data randomly and indefinitely
- we should distinct between randomly choosing some batch and creating a new batch of randomly chosen observations
[ ] filterobs(f, data...) TODO: description
[x] batches(data...; size, count) Returns a vector of the result of datasubset which each denote a distinct equally sized subset of the observations in data
- size denotes the number of observations in each batch
- count denotes the number of batches
- we should again allow for either size or count to be specified by making use of https://github.com/JuliaML/MLDataUtils.jl/blob/master/src/dataiterators/minibatches.jl#L7-L34
- This will replace MiniBatches
[x] splitobs(data...; at) Returns a vector of the result of datasubset which denote a distinct subset of the observations in data
- similar to batchesbut makes the function less convoluted
- at denotes a fraction of data in the first split.
- at can be a NTuple, which would result in N + 1 splits
[x] kfolds(data...; k) Lowercase API for KFolds
[x] leaveout(data...; size) Leave N out API for KFolds
- defaults to leave one out

Evizero commented 8 years ago

I think we should take a similar approach to DataSubset as the new ImageCore does with ColorView (see https://github.com/JuliaImages/ImageCore.jl/blob/master/src/colorchannels.jl#L233-L236)

Concretely I mean that

DataSubset(A::Union{Array,SubArray}, idx) returns a DataSubset, just like for any other type
datasubset(A::Union{Array,SubArray}, idx) tries to be smart and returns for these types a SubArray instead.

What that means is that functions such as batches and splitobs avoid the need to return a DataSubset for plain arrays, which for many usecases is not really needed. eachobs on the other hand would always return a DataSubset

To make this work we need to put Base.get(::DataSubset) aside in favour of getobs(::DataSubset) (note no second parameter), because this can then be the identity function for plain arrays.

Evizero commented 8 years ago

One small issue to be aware of that is a consequence of dropping trailing dimensions is that once one is down to a single observation, nobs breaks

X = rand(2,10)
y = rand(10)

s = DataSubset((X,y))

print(rand(s)) # => ([0.0757387,0.54726],0.08468408493279833)
print(nobs(rand(s))) # => 2

this is because the single observation with two features is now interpreted as two obervation of one feature

Evizero commented 8 years ago

Currently shuffled seems a bit inconsistent. For now I won't force a DataSubset on shuffled

since we can't do

for (x,y) in (X,Y)
    #...
end

but call eachobs instead like this:

for (x,y) in eachobs(X,Y) # which calls eachobs((X,Y))
    #...
end

it makes sense to do the same for shuffled:

for (x,y) in eachobs(shuffled(X,Y))
   #... 
end

This way we can allow shuffled to return a SubArray for when the parameter are plain Array.

tbreloff commented 8 years ago

Sounds fine

On Monday, October 17, 2016, Christof Stocker notifications@github.com wrote:

Currently shuffled seems a bit inconsistent. For now I won't force a DataSubset on shuffled

since we can't do

for (x,y) in (X,Y)

...end

but call eachobs instead like this:

for (x,y) in eachobs(X,Y)

...end

it makes sense to do the same for shuffled:

for (x,y) in eachobs(shuffled(X,Y))

... end

This way we can allow shuffled to return a SubArray for when the parameter are plain Array.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/issues/13#issuecomment-254361829, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492nsMQsZR8OGPrs1IG6xopLnnYEzTks5q1AF7gaJpZM4KYXna .

Evizero commented 8 years ago

I encountered some issues with DataSubset(::Tuple), that were probably the reason you introduced DataSubsets. Otherwise for (x,y) in batches(X,Y) ... would not work properly, since next doesn't return a tuple, but a DataSubset.

I solved this in a nice way I think. Now tuples are always unrolled. for example

X = rand(5,10) # Array will NOT get boxed into DataSubset
y = sprand(10, .5) # SparseVector do get boxed

typeof(datasubset((X,y), 1:5))

Tuple{SubArray{Float64,...},MLDataUtils.DataSubset{SparseVector,...}}

This way DataSubset serves as a special fallback for SubArray.

So far this seems to solve all problems

JuliaML / MLDataUtils.jl

Data Access Pattern in 0.5 #13

outdated!! take a look at #15

General design decisions

Types

Functions

...end

...end

... end