A continuation of #3 and #13 . Implementation is happening in the refactor0.5 branch
Main goals
Dispatch on batch- or observation iteration
Use AbstractVector where sensible for its benefits
Allow working with native types where sensible (especially arrays)
copy on demand
Issues so far
Using AbstractVector as a baseclass is not without problems. The main issue is when we want a infinite obs and infinite batches iterator that sample randomly as they don't have a natural size and furthermore indexing into them is ill defined (i.e. my_infinite_batches[11] is not deterministic and thus providing it is a mummer's farce)
Working with tuples should be convenient. It is not that useful if tuples are always decorated within an iterator. Although at times it might be.
Observation Dimension
First of all I would like to once and for all dodge the "which array dimension denotes an observation" debate by allowing any to be specified (while defaulting to last).
To do so we can introduce types in a submodule that we can dispatch on
[x] ObsDim.First() Use first dimension to denote the observations
[x] ObsDim.Last() Use last dimension to denote the observations
[x] ObsDim.Constant(dim) Use given dimension dim to denote the observations
Note that those may not always make sense for types other than AbstractArray, in which case it is ignored when specified (maybe with a warning).
All functions that in one way or another need to know which dimension to use for the observation have an optional parameter allowing a ::ObsDimension to be specified. As mentioned before, if nothing is specified explicitly Last() is assumed.
This will functionality - once implemented and proven to be working - will most likely move to LearnBase
Proposed Solution
To allow the use of the access pattern the following functions have to be implemented for the desired type
getobs(::MyType, indices, [::ObsDimension])Returns the observations for the given indices in their native form
indices can be of Int or AbstractVector
Optionally an ObsDimension parameter can be used. If no such method is implemented for your type then the parameter is discarded.
nobs(::MyType, [::ObsDimension])Returns the total number of observations in the data structure
Optionally an ObsDimension parameter can be used. If no such method is implemented for your type then the parameter is discarded.
Data Subset
Subsets are really just a placeholder for what is underneath, which is a subset of some data, and should be treated like any other type, such as SubArray. It is not iterateable and is also unrolled on tuples, i.e. @assert typeof(datasubset(my_tuple)) <: Tuple. Where possible native types are used instead of DataSubset, for example @assert typeof(datasubset(my_array)) <: SubArray
The following functions make use of just DataSubset without other fancy things:
[x] datasubsetSmart constructor for DataSubset
for data::(Sub)Array this returns a SubArray with randomized indices into the last dimension
internal functions usually work with this smart constructor instead of DataSubset directly
[x] shuffleobsReturns a datasubset where the ordering of the underlying observations is randomized
This should only permute the indices.
[x] splitobsReturns a vector of the results of datasubset which denote a distinct subset of the observations in data
e.g. used to split data into training- and test set
at denotes a fraction of data in the first split.
at can be a NTuple, which would result in N + 1 splits
When working with plain arrays, no custom types will appear anywhere.
Data View
Similar to the new Images effort, I think it makes sense to provide special view types that allow treating data as a vector of observations or a vector of batches. These views are also lazy, meaning the subsets are created when indexed into.
These do not treat tuples in a special way. i.e. @assert typeof(ObsView(some_tuple)) <: ObsView
[x] eachobsalias for BufferGetObs(ObsView(...)) constructor
[x] eachbatchalias for BufferGetObs(BatchView(...)) constructor
Both iterates over the dataset once, but reuse a buffer each iteration to provide the actual data (not a lazy subset). I.e. for memory efficient iteration.
Data Iterator
Yet we would like to allow other obs provider and batch provider that do not lend themselves to be substypes of abstract vectors (e.g. infinite obs)
DataIterator
ObsIterator <: DataIterator
RandomObs <: ObsIterator # Randomly sampled obs as datasubset (can be used for infinite iteration)
BatchIterator <: DataIterator
RandomBatches <: BatchIterator # Randomly sampled batches as datasubset (can be used for infinite iteration)
This way it is extendible for users, which are able to subtype from ObsIterator etc. These make no guarantees of being abstract vectors. This we we can later extend to provide stratified iterators.
The remaining issue is that we don't have a common supertype for RandomObs and ObsView.
Without multiple inheritance this seems like the cleanest solution at the moment. Given that ObsIterator can be subtyped, this should not have any negative consequences concerning extendability.
Other pattern
Down the road we may come up with a generalization of resampling pattern (for example when introducing a stratified version of k-folds), but for now it seems sufficient to not supertype KFolds
[x] KFoldsIterator over k pairs of (train,test) splits, where each split becomes the testset once
the sizes of the folds may differ by up to 1 observation depending on if the total number of observations is dividable by k
fusion of current KFolds and LabeledKFolds
[x] kfolds(data, k, obsdim)Lowercase API for KFolds
[x] leaveout(data, size, obsdim)Leave N out API for KFolds
A continuation of #3 and #13 . Implementation is happening in the refactor0.5 branch
Main goals
AbstractVector
where sensible for its benefitsIssues so far
Using
AbstractVector
as a baseclass is not without problems. The main issue is when we want a infinite obs and infinite batches iterator that sample randomly as they don't have a natural size and furthermore indexing into them is ill defined (i.e.my_infinite_batches[11]
is not deterministic and thus providing it is a mummer's farce)Working with tuples should be convenient. It is not that useful if tuples are always decorated within an iterator. Although at times it might be.
Observation Dimension
First of all I would like to once and for all dodge the "which array dimension denotes an observation" debate by allowing any to be specified (while defaulting to last).
To do so we can introduce types in a submodule that we can dispatch on
ObsDim.First()
Use first dimension to denote the observationsObsDim.Last()
Use last dimension to denote the observationsObsDim.Constant(dim)
Use given dimensiondim
to denote the observationsNote that those may not always make sense for types other than
AbstractArray
, in which case it is ignored when specified (maybe with a warning). All functions that in one way or another need to know which dimension to use for the observation have an optional parameter allowing a::ObsDimension
to be specified. As mentioned before, if nothing is specified explicitlyLast()
is assumed.This will functionality - once implemented and proven to be working - will most likely move to
LearnBase
Proposed Solution
To allow the use of the access pattern the following functions have to be implemented for the desired type
getobs(::MyType, indices, [::ObsDimension])
Returns the observations for the givenindices
in their native formindices
can be ofInt
orAbstractVector
Optionally an
ObsDimension
parameter can be used. If no such method is implemented for your type then the parameter is discarded.nobs(::MyType, [::ObsDimension])
Returns the total number of observations in the data structureObsDimension
parameter can be used. If no such method is implemented for your type then the parameter is discarded.Data Subset
Subsets are really just a placeholder for what is underneath, which is a subset of some data, and should be treated like any other type, such as
SubArray
. It is not iterateable and is also unrolled on tuples, i.e.@assert typeof(datasubset(my_tuple)) <: Tuple
. Where possible native types are used instead ofDataSubset
, for example@assert typeof(datasubset(my_array)) <: SubArray
The following functions make use of just
DataSubset
without other fancy things:[x]
datasubset
Smart constructor forDataSubset
for
data::(Sub)Array
this returns aSubArray
with randomized indices into the last dimensioninternal functions usually work with this smart constructor instead of
DataSubset
directly[x]
shuffleobs
Returns adatasubset
where the ordering of the underlying observations is randomized[x]
splitobs
Returns a vector of the results ofdatasubset
which denote a distinct subset of the observations in datae.g. used to split data into training- and test set
at
denotes a fraction of data in the first split.at
can be a NTuple, which would result in N + 1 splitsWhen working with plain arrays, no custom types will appear anywhere.
Data View
Similar to the new Images effort, I think it makes sense to provide special view types that allow treating data as a vector of observations or a vector of batches. These views are also lazy, meaning the subsets are created when indexed into.
These do not treat tuples in a special way. i.e.
@assert typeof(ObsView(some_tuple)) <: ObsView
[x]
eachobs
alias forBufferGetObs(ObsView(...))
constructor[x]
eachbatch
alias forBufferGetObs(BatchView(...))
constructorBoth iterates over the dataset once, but reuse a buffer each iteration to provide the actual data (not a lazy subset). I.e. for memory efficient iteration.
Data Iterator
Yet we would like to allow other obs provider and batch provider that do not lend themselves to be substypes of abstract vectors (e.g. infinite obs)
This way it is extendible for users, which are able to subtype from
ObsIterator
etc. These make no guarantees of being abstract vectors. This we we can later extend to provide stratified iterators. The remaining issue is that we don't have a common supertype forRandomObs
andObsView
.Without multiple inheritance this seems like the cleanest solution at the moment. Given that
ObsIterator
can be subtyped, this should not have any negative consequences concerning extendability.Other pattern
Down the road we may come up with a generalization of resampling pattern (for example when introducing a stratified version of k-folds), but for now it seems sufficient to not supertype
KFolds
[x]
KFolds
Iterator over k pairs of (train,test) splits, where each split becomes the testset oncethe sizes of the folds may differ by up to 1 observation depending on if the total number of observations is dividable by
k
fusion of current KFolds and LabeledKFolds
[x]
kfolds(data, k, obsdim)
Lowercase API forKFolds
[x]
leaveout(data, size, obsdim)
Leave N out API forKFolds