Closed Evizero closed 8 years ago
I think we should take a similar approach to DataSubset
as the new ImageCore
does with ColorView
(see https://github.com/JuliaImages/ImageCore.jl/blob/master/src/colorchannels.jl#L233-L236)
Concretely I mean that
DataSubset(A::Union{Array,SubArray}, idx)
returns a DataSubset
, just like for any other typedatasubset(A::Union{Array,SubArray}, idx)
tries to be smart and returns for these types a SubArray
instead.What that means is that functions such as batches
and splitobs
avoid the need to return a DataSubset
for plain arrays, which for many usecases is not really needed. eachobs
on the other hand would always return a DataSubset
To make this work we need to put Base.get(::DataSubset)
aside in favour of getobs(::DataSubset)
(note no second parameter), because this can then be the identity function for plain arrays.
One small issue to be aware of that is a consequence of dropping trailing dimensions is that once one is down to a single observation, nobs
breaks
X = rand(2,10)
y = rand(10)
s = DataSubset((X,y))
print(rand(s)) # => ([0.0757387,0.54726],0.08468408493279833)
print(nobs(rand(s))) # => 2
this is because the single observation with two features is now interpreted as two obervation of one feature
Currently shuffled
seems a bit inconsistent. For now I won't force a DataSubset
on shuffled
since we can't do
for (x,y) in (X,Y)
#...
end
but call eachobs
instead like this:
for (x,y) in eachobs(X,Y) # which calls eachobs((X,Y))
#...
end
it makes sense to do the same for shuffled:
for (x,y) in eachobs(shuffled(X,Y))
#...
end
This way we can allow shuffled to return a SubArray
for when the parameter are plain Array
.
Sounds fine
On Monday, October 17, 2016, Christof Stocker notifications@github.com wrote:
Currently shuffled seems a bit inconsistent. For now I won't force a DataSubset on shuffled
since we can't do
for (x,y) in (X,Y)
...end
but call eachobs instead like this:
for (x,y) in eachobs(X,Y)
...end
it makes sense to do the same for shuffled:
for (x,y) in eachobs(shuffled(X,Y))
... end
This way we can allow shuffled to return a SubArray for when the parameter are plain Array.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/issues/13#issuecomment-254361829, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492nsMQsZR8OGPrs1IG6xopLnnYEzTks5q1AF7gaJpZM4KYXna .
I encountered some issues with DataSubset(::Tuple)
, that were probably the reason you introduced DataSubsets
. Otherwise for (x,y) in batches(X,Y) ...
would not work properly, since next
doesn't return a tuple, but a DataSubset
.
I solved this in a nice way I think. Now tuples are always unrolled. for example
X = rand(5,10) # Array will NOT get boxed into DataSubset
y = sprand(10, .5) # SparseVector do get boxed
typeof(datasubset((X,y), 1:5))
Tuple{SubArray{Float64,...},MLDataUtils.DataSubset{SparseVector,...}}
This way DataSubset
serves as a special fallback for SubArray
.
So far this seems to solve all problems
outdated!! take a look at #15
List of functionality to be implemented / adapted. A continuation of #3 and based on the proposed changes of Tom in datasubsets.jl at StochasticOptimization.jl.
General design decisions
viewobs((X,Y), 1:5)
returns a tuple(viewobs(X,1:5), viewobs(Y,1:5))
getobs(::MyType, [idx])
andnobs(::MyType)
I currently do not see a reason to have an abstract supertype or aEdit: solvedDataSubsets
. we should discuss this hereTypes
DataSubset
Lazy subsetting of source data, tracking the indices of observations in the source data. This serves as a ML specific generalization of SubArrays for any kind of datasource.datasubset
serves as a smart constructor which returns a type-native view of the data (e.g.SubArray
forArray
)DataSubset
getobs
performs the actual subsetting when neededcollect
creates a copy ? should be usecopy
instead? why not?DataIterator
_Abstract Baseclass for all iterators that iterate through data usingdatasubset
,getobs
, andnobs
EachObs
Foreachobs
. iterates through a dataset one observation at a timeEachBatch
Foreachbatch
. iterates through a dataset a batch at a time. all equal sizedKFolds
Iterator over k pairs of (train,test) splits, where each split becomes the testset oncek
KFolds
andLabeledKFolds
Functions
getobs(data, [idx])
Returns the observations in their native formSubArrays
will be copied.Array
andSubArray
should support this as wellviewobs(data, [idx])
Returns a view into the observationsdatasubset
SubArray
fordata::(Sub)Array
DataSubset
by default for other typesnobs
Returns the number of observations in the data structureArray
andSubArray
should support this as welleachobs(data...)
Returns aEachObs
of the givendata
, which will allow iteration through all observationdata
should be obeyedeachbatch(data...; size, count)
Returns aEachBatch
of the givendata
, which will allow iteration through all observation incount
batches of sizesize
data
should be obeyedshuffled(data...)
Returns aDataSubset
where the ordering of the underlying observations is randomizeddata::(Sub)Array
this returns aSubArray
with randomized indices into the last dimensionRandomSamples
shuffled!(data...)
Returnsdata
, but with the observations shuffled.data::DataSubset
this shuffles the indices inplace.data::Array
this moves around observations and returns the same Array (no boxing)shuffleobs!
maybe?infinite_obs(data...)
Returns anInfiniteObs
that samples one observation at a time fromdata
randomly and indefinitelyinfinite_batches(data...)
Returns anInfiniteBatches
that when iterated over samples one batch at a time fromdata
randomly and indefinitelyfilterobs(f, data...)
TODO: descriptionbatches(data...; size, count)
Returns a vector of the result ofdatasubset
which each denote a distinct equally sized subset of the observations indata
size
denotes the number of observations in each batchcount
denotes the number of batchesMiniBatches
splitobs(data...; at)
Returns a vector of the result ofdatasubset
which denote a distinct subset of the observations indata
batches
but makes the function less convolutedat
denotes a fraction of data in the first split.at
can be a NTuple, which would result in N + 1 splitskfolds(data...; k)
Lowercase API forKFolds
leaveout(data...; size)
Leave N out API forKFolds