JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

Data Access Pattern in 0.5 [second iteration] #15

Closed Evizero closed 7 years ago

Evizero commented 8 years ago

A continuation of #3 and #13 . Implementation is happening in the refactor0.5 branch

Main goals

Issues so far

Observation Dimension

First of all I would like to once and for all dodge the "which array dimension denotes an observation" debate by allowing any to be specified (while defaulting to last).

To do so we can introduce types in a submodule that we can dispatch on

Note that those may not always make sense for types other than AbstractArray, in which case it is ignored when specified (maybe with a warning). All functions that in one way or another need to know which dimension to use for the observation have an optional parameter allowing a ::ObsDimension to be specified. As mentioned before, if nothing is specified explicitly Last() is assumed.

This will functionality - once implemented and proven to be working - will most likely move to LearnBase

Proposed Solution

To allow the use of the access pattern the following functions have to be implemented for the desired type

Data Subset

Subsets are really just a placeholder for what is underneath, which is a subset of some data, and should be treated like any other type, such as SubArray. It is not iterateable and is also unrolled on tuples, i.e. @assert typeof(datasubset(my_tuple)) <: Tuple. Where possible native types are used instead of DataSubset, for example @assert typeof(datasubset(my_array)) <: SubArray

The following functions make use of just DataSubset without other fancy things:

When working with plain arrays, no custom types will appear anywhere.

Data View

Similar to the new Images effort, I think it makes sense to provide special view types that allow treating data as a vector of observations or a vector of batches. These views are also lazy, meaning the subsets are created when indexed into.

DataView <: AbstractVector
    ObsView <: DataView
    BatchView <: DataView

These do not treat tuples in a special way. i.e. @assert typeof(ObsView(some_tuple)) <: ObsView

Both iterates over the dataset once, but reuse a buffer each iteration to provide the actual data (not a lazy subset). I.e. for memory efficient iteration.

Data Iterator

Yet we would like to allow other obs provider and batch provider that do not lend themselves to be substypes of abstract vectors (e.g. infinite obs)

DataIterator
    ObsIterator <: DataIterator
        RandomObs <: ObsIterator # Randomly sampled obs as datasubset (can be used for infinite iteration)
    BatchIterator <: DataIterator
        RandomBatches <: BatchIterator # Randomly sampled batches as datasubset (can be used for infinite iteration)

This way it is extendible for users, which are able to subtype from ObsIterator etc. These make no guarantees of being abstract vectors. This we we can later extend to provide stratified iterators. The remaining issue is that we don't have a common supertype for RandomObs and ObsView.

typealias AbstractObsIterator Union{ObsIterator, ObsView}
typealias AbstractBatchIterator Union{BatchIterator, BatchView}

Without multiple inheritance this seems like the cleanest solution at the moment. Given that ObsIterator can be subtyped, this should not have any negative consequences concerning extendability.

Other pattern

Down the road we may come up with a generalization of resampling pattern (for example when introducing a stratified version of k-folds), but for now it seems sufficient to not supertype KFolds

Evizero commented 7 years ago

moved to and implemented in https://github.com/JuliaML/MLDataPattern.jl