JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

WIP: Data Access Pattern in 0.5 #14

Closed Evizero closed 8 years ago

Evizero commented 8 years ago

WIP implementation for #13


I have updated DataSubset and its tests to 0.5 so far.

Next I'll port Tom's new verbs (in this PR)

Also I shall try to update the package documentation a bit and try avoid breaking changes if I can

tbreloff commented 8 years ago

@Evizero overall this is coming along really nicely... thanks for putting in the time

Evizero commented 8 years ago

I had to introduce a immutable type DataIterator to make eachobs work. This is a consequence of not boxing Tuple into DataSubset. The alternative would be to call batches with batch-size 1, which seems like a waste, since it allocates an array of N DataSubsets

tbreloff commented 8 years ago

I saw... certainly better than using DataSubsets!

Evizero commented 8 years ago

I think I could twist this one type DataIterator in a way that it could also serve as a batch stream. Then we could offer eachbatch as well, which would be a true iterator, while batches is done instantly and remains as it is now

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-57.8%) to 39.706% when pulling 7a33ed582bfad9131e740d67e049e69f1e55c9ae on refactor0.5 into d5503fccb665b3ad37b588536896d203695ae3c8 on master.

Evizero commented 8 years ago

shoo coveralls. come back when I am done here. You make me look bad!

shoo

Evizero commented 8 years ago

Ok, so the stuff I have implemented so far works pretty well and type stable. KFolds might be a little tricky to port, but other than that it should be smooth sailing now.

tbreloff commented 8 years ago

Nice!

On Tuesday, October 18, 2016, Christof Stocker notifications@github.com wrote:

Ok, so the stuff I have implemented so far works pretty well and type stable. KFolds might be a little tricky to port, but other than that it should be smooth sailing now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/pull/14#issuecomment-254673866, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492pkBlEt8KHkHhkRWxAQmLdAvd88Bks5q1V42gaJpZM4KZHBf .

tbreloff commented 8 years ago

Allowing both getobs and viewobs is a nice idea (as long as viewobs is the default!)

Evizero commented 8 years ago

I changed how getobs works a bit.

Evizero commented 8 years ago

as long as viewobs is the default!

They are not really used internally expect from the function eachobs, which sounds like one would expect it to getobs every index Edit: no longer. So it is just user-facing code. All the functions such as batches and shuffled remain lazy as ever and call datasubset directly (which to be fair is the same as viewobs atm, but that may change).

tbreloff commented 8 years ago

remain lazy as ever and call datasubset directly

That's what I mean... I want to make sure everything is lazy unless you explicitly ask for copies through collect/getobs.

tbreloff commented 8 years ago

I checked out refactor0.5, and commented out the iterators in StochasticOptimization. I'll try to get the surrounding ecosystem up to speed and tested.

Evizero commented 8 years ago

nice! There is still some new stuff missing and I will need to work on a different project for a few hours soon. But I will resume this PR later today.

tbreloff commented 8 years ago

No worries... I think I'm the only one using this stuff for any real projects. (I mean... I hope I'm the only one :smile: )

tbreloff commented 8 years ago

Question... can DataIterator and DataSubset at least subtype from a common abstract? AbstractSubset? They seem to have very similar behavior... maybe I don't quite understand their distinction yet

Evizero commented 8 years ago

Well to be fair, DataIterator doesn't need getobs and nobs, and DataSubset doesn't need to support the iterator pattern. These thing are only implemented because they can be. DataSubset alone is not enough since there is no reliable way to implement eachobs then for tuples, or eachbatch for that matter.

tbreloff commented 8 years ago

DataSubset alone is not enough since there is no reliable way to implement eachobs then for tuples, or eachbatch for that matter.

I'm still confused here... I thought my implementation in StochasticOptimization handled any type just fine.

Evizero commented 8 years ago

yea, but you boxed Tuple into a DataSubset, while this doesn't. This is a consequence of allowing types to offer their own views if they have such, which is the reason that if one works with Array variables, that one never has to deal with DataSubset

tbreloff commented 8 years ago

This does a copy, which I was really hoping to avoid:

julia> using MLDataUtils
INFO: Recompiling stale cache file /home/tom/.julia/lib/v0.5/MLDataUtils.ji for module MLDataUtils.

julia> x,y = rand(10,2), rand(10)
(
[0.523045 0.975809; 0.963106 0.955312; … ; 0.215673 0.988101; 0.555086 0.376569],

[0.582727,0.160317,0.983288,0.0741725,0.789932,0.841186,0.119106,0.534921,0.0328525,0.0659002])

julia> for o in eachobs(x)
           @show typeof(o), o
       end
(typeof(o),o) = (Array{Float64,1},[0.523045,0.963106,0.0167039,0.928771,0.726239,0.656165,0.976564,0.350933,0.215673,0.555086])
(typeof(o),o) = (Array{Float64,1},[0.975809,0.955312,0.85547,0.578021,0.959107,0.769584,0.114884,0.31311,0.988101,0.376569])
Evizero commented 8 years ago

this is exactly our discussion above, which I will now change to not copy

tbreloff commented 8 years ago

I had to introduce a immutable type DataIterator to make eachobs work. This is a consequence of not boxing Tuple into DataSubset.

Reading back through the conversation, I think this is what I don't yet understand. My implementation worked well (I thought) by boxing the tuple inside the DataSubset... I don't see why a tuple of DataSubsets would be more performant. What functionality didn't work?

Evizero commented 8 years ago

I would need to either box everything into a DataSubset again, or overload functions for tuples, which would interfere with outside code when MLDataUtils is loaded

tbreloff commented 8 years ago

Are you talking specifically about the case of eachbatch-style iteration? You mean that each batch would need to be re-wrapped with a DataSubset, but you want to return a tuple of views?

Evizero commented 8 years ago

I am saying that I would like the following

X, Y = load_iris()

# tuple in results tuple out
tup = datasubset((X,Y), 1:10)
@assert typeof(tup) <: Tuple

x_sub, y_sub = datasubset((X,Y), 1:10)
@assert typeof(x_sub) <: SubArray
@assert typeof(y_sub) <: SubArray

train, test = splitobs(X,Y)
# train and test are Tuple of SubArrays 

Now the problem arises for this kind of use:

for (x,y) in eachobs(X,Y)
    # ...
end

if eachobs calls datasubset, it would return a Tuple. The code above would then call start, next etc for the Tuple. I would like to avoid overloading the iterator functions for Tuple, hence eachobs returns a DataIterator

tbreloff commented 8 years ago

Ok I'm playing around with it and warming up to the idea of always returning tuples/views. I guess I'm wondering if we can merge the DataIterator and DataSubset types. Do they really need to be separate?

Evizero commented 8 years ago

If I find a way to remove one cleanly, I will

tbreloff commented 8 years ago

So the first problem I'm up against... in StochasticOptimization I dispatch the search_direction method on whether I have a single observation or an AbstractSubset (which implies it's a batch). The problem is, I have no idea what a "single observation" looks like... only that it's not an AbstractSubset. A single observation might be a Float64, or it might be a tuple of 3D arrays. So if my loop was essentially:

for batch in eachbatch(data, size=10)
    ...
    sd = search_direction(batch)
    ...
end

then I have no idea whether I'm processing one observation or one batch purely on the types.

tbreloff commented 8 years ago

What if batches/splitobs returned a DataIterator as well, and we just expand the parameters a bit to include the DataSubset functionality at the same time? The goal would be one type that represents "something to be iterated over", and a type parameter from which we could say whether we are going to get observations or batches. I'll play with that idea and see if I can come up with something.

Evizero commented 8 years ago

Then we have the same situation again with the tuples. mhm. I'll think about the problem a bit after work.

Generally I would like to offer a user as little complexity as possible and make other things opt-in.

tbreloff commented 8 years ago

Well if I can dispatch at the MetaLearner call on what type of iterator is passed in (i.e. iterating over observations or batches), then I can somehow pass that information through myself. It's certainly more clunky and the implementation leaks, but it could work. And I agree that most people would prefer the behavior you're going for. I think I have an idea... I'll implement it and push it up as a separate branch so you can look.

Evizero commented 8 years ago

What if we offer a typed API that starts with an upper case? like Eachobs and Eachbatch, which return specials types to dispatch on, but internally use the lower case methods to perform their duties

tbreloff commented 8 years ago

What if we offer a typed API start starts with an upper case? like Eachobs and Eachbatch, which return specials types to dispatch on, but internally use the lower case methods to perform their duties

Then we'd be forcing the users to always use an uppercase version when dispatch is needed? Seems confusing and fragile. I'm working on fixing this and merging DataSubset into DataIterator... give me an hour or so.

tbreloff commented 8 years ago

I think this is working out well. The changes:

Here's a demo:

julia> using MLDataUtils; n=5;  x, y = rand(10,n),rand(n);

julia> datasubset(x)
DataIterator{Array{Float64,2}}: 5 observations

julia> eachobs(x)
DataIterator{Array{Float64,2}}: 5 observations

julia> datasubset(x, [1,2])
DataIterator{Array{Float64,2}}: 2 observations

julia> datasubset(x, 1:2)
DataIterator{Array{Float64,2}}: 2 batches with 2 obs each

julia> eachbatch(x)
DataIterator{Array{Float64,2}}: 5 batches with 1 obs each

julia> eachbatch(x,size=2)
INFO: The specified values for size and/or count will result in 1 unused data points
DataIterator{Array{Float64,2}}: 2 batches with 2 obs each

julia> typeof(shuffled(x))
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}

julia> batches(x)
DataIterator{Array{Float64,2}}: 5 batches with indices: UnitRange{Int64}[1:1,2:2,3:3,4:4,5:5]

julia> batches(x, size=2)
INFO: The specified values for size and/or count will result in 1 unused data points
DataIterator{Array{Float64,2}}: 2 batches with indices: UnitRange{Int64}[1:2,3:4]

julia> MLDataUtils.isbatches(ans)
true

julia> eachbatch(x,size=2)
INFO: The specified values for size and/or count will result in 1 unused data points
DataIterator{Array{Float64,2}}: 2 batches with 2 obs each

julia> MLDataUtils.isbatches(ans)
true

julia> datasubset(x, [1,2])
DataIterator{Array{Float64,2}}: 2 observations

julia> MLDataUtils.isbatches(ans)
false

julia> splitobs(x, at=(0.2,0.3,0.2))
DataIterator{Array{Float64,2}}: 4 batches with indices: UnitRange{Int64}[1:1,2:3,4:4,5:5]

julia> MLDataUtils.isbatches(ans)
true

julia> for b in splitobs(x, at=0.3)
           @show typeof(b) b
       end
typeof(b) = SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
b = [0.84198 0.512395; 0.447355 0.324533; 0.671367 0.305329; 0.606857 0.418458; 0.727636 0.738322; 0.836541 0.302326; 0.040053 0.254213; 0.572415 0.874371; 0.884836 0.00895098; 0.652987 0.658197]
typeof(b) = SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
b = [0.849606 0.0897501 0.256851; 0.192671 0.466011 0.737302; 0.121008 0.00902418 0.125703; 0.18939 0.0111741 0.397602; 0.295394 0.00677543 0.854273; 0.946626 0.967494 0.0430847; 0.213562 0.760838 0.468581; 0.00534368 0.40582 0.650375; 0.50318 0.987332 0.0669072; 0.79452 0.0297965 0.73686]
tbreloff commented 8 years ago

f68fd32

Evizero commented 8 years ago

julia> splitobs(x, at=(0.2,0.3,0.2)) DataIterator{Array{Float64,2}}: 4 batches with indices: UnitRange{Int64}[1:1,2:3,4:4,5:5]

This is exactly what I want to avoid though

tbreloff commented 8 years ago

This is exactly what I want to avoid though

Can you explain why? This seems to be an ideal result:

julia> test,train = splitobs(x, at=0.6)
DataIterator{Array{Float64,2}}: 2 batches with indices: UnitRange{Int64}[1:3,4:5]

julia> typeof(test)
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}

julia> typeof(train)
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
Evizero commented 8 years ago

I'd rather not have everything be an iterator. Especially not when the user decides to do something like sub = viewobs(X, 1:4) if X is an Array

tbreloff commented 8 years ago
julia> viewobs(x) |> typeof
Array{Float64,2}

julia> viewobs(x, 1:4) |> typeof
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
tbreloff commented 8 years ago
Evizero commented 8 years ago

Ok, I obviously need to take a look at the actual code and not just the few examples you posted. My conference call will end soon and I shall give it an honest joyride. Sorry for half-informed pre-emptive judgement.

tbreloff commented 8 years ago

And I think we could take this one step further and do:

abstract DataIterator
type ObsIterator <: DataIterator ... end
type BatchIterator <: DataIterator ... end

and keep most of the same code except some of the bonus methods like infinite_obs and infinite_batches would have cleaner implementations.

tbreloff commented 8 years ago

I added another param, and an abstraction, (6593e12) and now the infinite methods are working:

julia> for (i,b) in enumerate(infinite_batches(x,y,size=2))
           @show i nobs(b) typeof(b)
           if i>5
               break
           end
       end
i = 1
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 2
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 3
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 4
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 5
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 6
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}

julia> for (i,b) in enumerate(infinite_obs(x,y))
           @show i typeof(b)
           if i>5
               break
           end
       end
i = 1
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 2
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 3
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 4
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 5
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 6
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
tbreloff commented 8 years ago

With the tb branch:

INFO: StochasticOptimization tests passed
tbreloff commented 8 years ago

Also, my Reinforce.jl code works as well. So I'm going to go back to working on that, and we can discuss how to move forward with MLDataUtils later when you're ready.

Evizero commented 8 years ago

A few thoughts

The more I play with it and think about it the more I want a special "view"/"subset" type that is only concerned with representing a lazy subset of the data, nothing more. It doesn't really reduce complexity to have one type that behaves like multiple types either, it just makes things more convoluted.

I would go the other way even, introducing a 3rd type by splitting the DataIterator that I have up into a "batch" and "obs" version.

Concerning your dispatch problem. If a user is allowed to call your code with plain arrays that can either be batches or single observations, then you need to be able to handle native types anyways. If this sub-setting is done within your code then you have the control to use eachbatch and eachobs for the dispatchable types.

I really really don't want everything to be a true iterator. It feels off and non-julian to me. Using Iterators should be conscious decision I think.

tbreloff commented 8 years ago

How do you propose implementing the "infinite" iterators?

Evizero commented 8 years ago

pretty much the same as you

tbreloff commented 8 years ago

It would be nice if:

infinite_obs --> ObsIterator
infinite_batches --> BatchIterator
Evizero commented 8 years ago

I agree. That thought crossed my mind as well