Closed Evizero closed 8 years ago
I think this is a step in the right direction. I need to think through some of the access patterns a little more to understand how they fit in.
I really like the idea that you can implement one method to get an observation from a custom data structure, and all the iteration patterns come for free.
Thanks for taking this on... I'll review more thoroughly tomorrow I think.
On Apr 10, 2016, at 10:01 AM, Christof Stocker notifications@github.com wrote:
After a lot of experimenting I found a way to realize #3 that I think works.
Basically the way this would work is that a subtype of DataSampler is just a parameter specification container that has type-parameters for the types of the features and targets respectively. The actual magic is then generally in the implementation of the Base.next method, that can be completely customized to specific types of features and targets. It works for labeled and unlabelled data
Example:
for batch_X in MiniBatches(X; batchsize = 10)
... train unsupervised model on batch here ...
end
for (batch_X, batch_y) in MiniBatches(X, y; batchsize = 10)
... train supervised model on batch here ...
end Thus far I implemented a MiniBatches sampler such that it works out-of-the-box-for AbstractArrays by copying the elements, and for concrete Arrays by using sub/slice. (Note that I think the MiniBatches iterator by itself should not do any shuffling in my opinion).
Now to reiterate my point, the really cool think about this approach is I could now implement a concrete Base.next in my image augmentation package for a DirImageSource or any other custom data container, and the MiniBatches iterator would work the same way seamlessly. Thus the approach is easily extensible by any user-package.
This PR is not complete, but it is far enough to gather feedback. Thoughts?
You can view, comment on, or merge this pull request online at:
https://github.com/JuliaML/MLDataUtils.jl/pull/4
Commit Summary
refactor tests for BaseTestNext Outline DataSampler on example of MiniBatches File Changes
M src/MLDataUtils.jl (6) A src/samplers.jl (137) M test/REQUIRE (1) M test/runtests.jl (23) M test/tst_datasets.jl (73) M test/tst_feature_scaling.jl (65) M test/tst_noisy_function.jl (33) A test/tst_samplers.jl (190) Patch Links:
https://github.com/JuliaML/MLDataUtils.jl/pull/4.patch https://github.com/JuliaML/MLDataUtils.jl/pull/4.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub
One thing that could use a better name is the function getobs
, which should return the specified observations in the most efficient manner possible
Included basic docstring documentation for MiniBatches and the option to process the batches in a random order (without performance penalty). However, observations within a batch will still be adjacent to each other in general (for vectors an arrays), which is the desired behaviour. For cases in which even the batch content should be random I will implement a RandomSampler
which will of course have performance penalty.
for (batch_X, batch_y) in MiniBatches(X, y; random_order = true)
# ... train supervised model on batch here ...
end
Also note that MiniBatches
will go through the data just once, effectively denoting an epoch. In other words it's purpose is to conveniently iterate over some dataset in equally-sized blocks, where the order in which those blocks are returned can be randomized.
So at this point I have the implementation for MiniBatches
and RandomSamples
in place. Both of which have a detailed documentation in the code and also the README.
I would be interested in opinions on what there is so far.
:100: This looks great!
I am getting close to a consistent state. Unless there are any complaints or wishes, then all that is left to do for me before I feel comfortable merging and tagging this PR is some code optimization.
The first two sections of the readme (which describes this PR in detail) is pretty much done at this point https://github.com/JuliaML/MLDataUtils.jl/tree/datasampler
Documentation looks great. Very nice work! I think it looks merge-ready.
After a lot of experimenting I found a way to realize #3 that I think works.
Basically the way this would work is that a subtype of
DataIterator
is just a parameter specification container that has type-parameters for the types of thefeatures
andtargets
respectively. The actual magic is then generally in the implementation of theBase.next
method, that can be completely customized to specific types offeatures
andtargets
. It works for labeled and unlabelled dataExample:
Thus far I implemented a
MiniBatches
sampler such that it works out-of-the-box-forAbstractArray
s by copying the elements, and for concreteArray
s by usingsub
/slice
. (Note that I think the MiniBatches iterator by itself should not do any shuffling in my opinion).Now to reiterate my point, the really cool thing about this approach is I could now implement a concrete
Base.next
in my image augmentation package for aDirImageSource
or any other custom data container, and the MiniBatches iterator would work the same way seamlessly. Thus the approach is easily extensible by any user-package.This PR is not complete, but it is far enough to gather feedback. Thoughts?