JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

Data Access Pattern in 0.5 #13

Closed Evizero closed 8 years ago

Evizero commented 8 years ago

outdated!! take a look at #15

List of functionality to be implemented / adapted. A continuation of #3 and based on the proposed changes of Tom in datasubsets.jl at StochasticOptimization.jl.

General design decisions

Evizero commented 8 years ago

I think we should take a similar approach to DataSubset as the new ImageCore does with ColorView (see https://github.com/JuliaImages/ImageCore.jl/blob/master/src/colorchannels.jl#L233-L236)

Concretely I mean that

What that means is that functions such as batches and splitobs avoid the need to return a DataSubset for plain arrays, which for many usecases is not really needed. eachobs on the other hand would always return a DataSubset

To make this work we need to put Base.get(::DataSubset) aside in favour of getobs(::DataSubset) (note no second parameter), because this can then be the identity function for plain arrays.

Evizero commented 8 years ago

One small issue to be aware of that is a consequence of dropping trailing dimensions is that once one is down to a single observation, nobs breaks

X = rand(2,10)
y = rand(10)

s = DataSubset((X,y))

print(rand(s)) # => ([0.0757387,0.54726],0.08468408493279833)
print(nobs(rand(s))) # => 2 

this is because the single observation with two features is now interpreted as two obervation of one feature

Evizero commented 8 years ago

Currently shuffled seems a bit inconsistent. For now I won't force a DataSubset on shuffled

since we can't do

for (x,y) in (X,Y)
    #...
end

but call eachobs instead like this:

for (x,y) in eachobs(X,Y) # which calls eachobs((X,Y))
    #...
end

it makes sense to do the same for shuffled:

for (x,y) in eachobs(shuffled(X,Y))
   #... 
end

This way we can allow shuffled to return a SubArray for when the parameter are plain Array.

tbreloff commented 8 years ago

Sounds fine

On Monday, October 17, 2016, Christof Stocker notifications@github.com wrote:

Currently shuffled seems a bit inconsistent. For now I won't force a DataSubset on shuffled

since we can't do

for (x,y) in (X,Y)

...end

but call eachobs instead like this:

for (x,y) in eachobs(X,Y)

...end

it makes sense to do the same for shuffled:

for (x,y) in eachobs(shuffled(X,Y))

... end

This way we can allow shuffled to return a SubArray for when the parameter are plain Array.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/issues/13#issuecomment-254361829, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492nsMQsZR8OGPrs1IG6xopLnnYEzTks5q1AF7gaJpZM4KYXna .

Evizero commented 8 years ago

I encountered some issues with DataSubset(::Tuple), that were probably the reason you introduced DataSubsets. Otherwise for (x,y) in batches(X,Y) ... would not work properly, since next doesn't return a tuple, but a DataSubset.

I solved this in a nice way I think. Now tuples are always unrolled. for example

X = rand(5,10) # Array will NOT get boxed into DataSubset
y = sprand(10, .5) # SparseVector do get boxed

typeof(datasubset((X,y), 1:5))
Tuple{SubArray{Float64,...},MLDataUtils.DataSubset{SparseVector,...}}

This way DataSubset serves as a special fallback for SubArray.

So far this seems to solve all problems