A continuation of #3 and #13 . Implementation is happening in the refactor0.5 branch

Main goals

Dispatch on batch- or observation iteration
Use AbstractVector where sensible for its benefits
Allow working with native types where sensible (especially arrays)
copy on demand

Issues so far

Using AbstractVector as a baseclass is not without problems. The main issue is when we want a infinite obs and infinite batches iterator that sample randomly as they don't have a natural size and furthermore indexing into them is ill defined (i.e. my_infinite_batches[11] is not deterministic and thus providing it is a mummer's farce)
Working with tuples should be convenient. It is not that useful if tuples are always decorated within an iterator. Although at times it might be.

Observation Dimension

First of all I would like to once and for all dodge the "which array dimension denotes an observation" debate by allowing any to be specified (while defaulting to last).

To do so we can introduce types in a submodule that we can dispatch on

[x] ObsDim.First() Use first dimension to denote the observations
[x] ObsDim.Last() Use last dimension to denote the observations
[x] ObsDim.Constant(dim) Use given dimension dim to denote the observations

Note that those may not always make sense for types other than AbstractArray, in which case it is ignored when specified (maybe with a warning). All functions that in one way or another need to know which dimension to use for the observation have an optional parameter allowing a ::ObsDimension to be specified. As mentioned before, if nothing is specified explicitly Last() is assumed.

This will functionality - once implemented and proven to be working - will most likely move to LearnBase

Proposed Solution

To allow the use of the access pattern the following functions have to be implemented for the desired type

getobs(::MyType, indices, [::ObsDimension]) Returns the observations for the given indices in their native form
- indices can be of Int or AbstractVector
- Optionally an ObsDimension parameter can be used. If no such method is implemented for your type then the parameter is discarded.
nobs(::MyType, [::ObsDimension]) Returns the total number of observations in the data structure
- Optionally an ObsDimension parameter can be used. If no such method is implemented for your type then the parameter is discarded.

Data Subset

Subsets are really just a placeholder for what is underneath, which is a subset of some data, and should be treated like any other type, such as SubArray. It is not iterateable and is also unrolled on tuples, i.e. @assert typeof(datasubset(my_tuple)) <: Tuple. Where possible native types are used instead of DataSubset, for example @assert typeof(datasubset(my_array)) <: SubArray

The following functions make use of just DataSubset without other fancy things:

[x] datasubset Smart constructor for DataSubset
- for data::(Sub)Array this returns a SubArray with randomized indices into the last dimension
- internal functions usually work with this smart constructor instead of DataSubset directly
[x] shuffleobs Returns a datasubset where the ordering of the underlying observations is randomized
- This should only permute the indices.
[x] splitobs Returns a vector of the results of datasubset which denote a distinct subset of the observations in data
- e.g. used to split data into training- and test set
- at denotes a fraction of data in the first split.
- at can be a NTuple, which would result in N + 1 splits

When working with plain arrays, no custom types will appear anywhere.

Data View

Similar to the new Images effort, I think it makes sense to provide special view types that allow treating data as a vector of observations or a vector of batches. These views are also lazy, meaning the subsets are created when indexed into.

DataView <: AbstractVector
    ObsView <: DataView
    BatchView <: DataView

These do not treat tuples in a special way. i.e. @assert typeof(ObsView(some_tuple)) <: ObsView

[x] eachobs alias for BufferGetObs(ObsView(...)) constructor
[x] eachbatch alias for BufferGetObs(BatchView(...)) constructor

Both iterates over the dataset once, but reuse a buffer each iteration to provide the actual data (not a lazy subset). I.e. for memory efficient iteration.

Data Iterator

Yet we would like to allow other obs provider and batch provider that do not lend themselves to be substypes of abstract vectors (e.g. infinite obs)

DataIterator
    ObsIterator <: DataIterator
        RandomObs <: ObsIterator # Randomly sampled obs as datasubset (can be used for infinite iteration)
    BatchIterator <: DataIterator
        RandomBatches <: BatchIterator # Randomly sampled batches as datasubset (can be used for infinite iteration)

This way it is extendible for users, which are able to subtype from ObsIterator etc. These make no guarantees of being abstract vectors. This we we can later extend to provide stratified iterators. The remaining issue is that we don't have a common supertype for RandomObs and ObsView.

typealias AbstractObsIterator Union{ObsIterator, ObsView}
typealias AbstractBatchIterator Union{BatchIterator, BatchView}

Without multiple inheritance this seems like the cleanest solution at the moment. Given that ObsIterator can be subtyped, this should not have any negative consequences concerning extendability.

Other pattern

Down the road we may come up with a generalization of resampling pattern (for example when introducing a stratified version of k-folds), but for now it seems sufficient to not supertype KFolds

[x] KFolds Iterator over k pairs of (train,test) splits, where each split becomes the testset once
- the sizes of the folds may differ by up to 1 observation depending on if the total number of observations is dividable by k
- fusion of current KFolds and LabeledKFolds
[x] kfolds(data, k, obsdim) Lowercase API for KFolds
[x] leaveout(data, size, obsdim) Leave N out API for KFolds
- defaults to leave one out

JuliaML / MLDataUtils.jl

Data Access Pattern in 0.5 [second iteration] #15