JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

WIP: Outline Data Access Patterns such as KFolds and MiniBatches #4

Closed Evizero closed 8 years ago

Evizero commented 8 years ago

After a lot of experimenting I found a way to realize #3 that I think works.

Basically the way this would work is that a subtype of DataIterator is just a parameter specification container that has type-parameters for the types of the features and targets respectively. The actual magic is then generally in the implementation of the Base.next method, that can be completely customized to specific types of features and targets. It works for labeled and unlabelled data

Example:

for batch_X in MiniBatches(X; size = 10)
     # ... train unsupervised model on batch here ...
end

for (batch_X, batch_y) in MiniBatches(X, y; size = 10)
     # ... train supervised model on batch here ...
end

Thus far I implemented a MiniBatches sampler such that it works out-of-the-box-for AbstractArrays by copying the elements, and for concrete Arrays by using sub/slice. (Note that I think the MiniBatches iterator by itself should not do any shuffling in my opinion).

Now to reiterate my point, the really cool thing about this approach is I could now implement a concrete Base.next in my image augmentation package for a DirImageSource or any other custom data container, and the MiniBatches iterator would work the same way seamlessly. Thus the approach is easily extensible by any user-package.

This PR is not complete, but it is far enough to gather feedback. Thoughts?

tbreloff commented 8 years ago

I think this is a step in the right direction. I need to think through some of the access patterns a little more to understand how they fit in.

I really like the idea that you can implement one method to get an observation from a custom data structure, and all the iteration patterns come for free.

Thanks for taking this on... I'll review more thoroughly tomorrow I think.

On Apr 10, 2016, at 10:01 AM, Christof Stocker notifications@github.com wrote:

After a lot of experimenting I found a way to realize #3 that I think works.

Basically the way this would work is that a subtype of DataSampler is just a parameter specification container that has type-parameters for the types of the features and targets respectively. The actual magic is then generally in the implementation of the Base.next method, that can be completely customized to specific types of features and targets. It works for labeled and unlabelled data

Example:

for batch_X in MiniBatches(X; batchsize = 10)

... train unsupervised model on batch here ...

end

for (batch_X, batch_y) in MiniBatches(X, y; batchsize = 10)

... train supervised model on batch here ...

end Thus far I implemented a MiniBatches sampler such that it works out-of-the-box-for AbstractArrays by copying the elements, and for concrete Arrays by using sub/slice. (Note that I think the MiniBatches iterator by itself should not do any shuffling in my opinion).

Now to reiterate my point, the really cool think about this approach is I could now implement a concrete Base.next in my image augmentation package for a DirImageSource or any other custom data container, and the MiniBatches iterator would work the same way seamlessly. Thus the approach is easily extensible by any user-package.

This PR is not complete, but it is far enough to gather feedback. Thoughts?

You can view, comment on, or merge this pull request online at:

https://github.com/JuliaML/MLDataUtils.jl/pull/4

Commit Summary

refactor tests for BaseTestNext Outline DataSampler on example of MiniBatches File Changes

M src/MLDataUtils.jl (6) A src/samplers.jl (137) M test/REQUIRE (1) M test/runtests.jl (23) M test/tst_datasets.jl (73) M test/tst_feature_scaling.jl (65) M test/tst_noisy_function.jl (33) A test/tst_samplers.jl (190) Patch Links:

https://github.com/JuliaML/MLDataUtils.jl/pull/4.patch https://github.com/JuliaML/MLDataUtils.jl/pull/4.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

Evizero commented 8 years ago

One thing that could use a better name is the function getobs, which should return the specified observations in the most efficient manner possible

Evizero commented 8 years ago

Included basic docstring documentation for MiniBatches and the option to process the batches in a random order (without performance penalty). However, observations within a batch will still be adjacent to each other in general (for vectors an arrays), which is the desired behaviour. For cases in which even the batch content should be random I will implement a RandomSampler which will of course have performance penalty.

for (batch_X, batch_y) in MiniBatches(X, y; random_order = true)
     # ... train supervised model on batch here ...
end
Evizero commented 8 years ago

Also note that MiniBatches will go through the data just once, effectively denoting an epoch. In other words it's purpose is to conveniently iterate over some dataset in equally-sized blocks, where the order in which those blocks are returned can be randomized.

Evizero commented 8 years ago

So at this point I have the implementation for MiniBatches and RandomSamples in place. Both of which have a detailed documentation in the code and also the README.

I would be interested in opinions on what there is so far.

tbreloff commented 8 years ago

:100: This looks great!

Evizero commented 8 years ago

I am getting close to a consistent state. Unless there are any complaints or wishes, then all that is left to do for me before I feel comfortable merging and tagging this PR is some code optimization.

The first two sections of the readme (which describes this PR in detail) is pretty much done at this point https://github.com/JuliaML/MLDataUtils.jl/tree/datasampler

tbreloff commented 8 years ago

Documentation looks great. Very nice work! I think it looks merge-ready.