Closed Evizero closed 8 years ago
@Evizero overall this is coming along really nicely... thanks for putting in the time
I had to introduce a immutable type DataIterator
to make eachobs
work. This is a consequence of not boxing Tuple into DataSubset
. The alternative would be to call batches with batch-size 1, which seems like a waste, since it allocates an array of N DataSubsets
I saw... certainly better than using DataSubsets!
I think I could twist this one type DataIterator
in a way that it could also serve as a batch stream. Then we could offer eachbatch
as well, which would be a true iterator, while batches
is done instantly and remains as it is now
shoo coveralls. come back when I am done here. You make me look bad!
Ok, so the stuff I have implemented so far works pretty well and type stable. KFolds
might be a little tricky to port, but other than that it should be smooth sailing now.
Nice!
On Tuesday, October 18, 2016, Christof Stocker notifications@github.com wrote:
Ok, so the stuff I have implemented so far works pretty well and type stable. KFolds might be a little tricky to port, but other than that it should be smooth sailing now.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaML/MLDataUtils.jl/pull/14#issuecomment-254673866, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492pkBlEt8KHkHhkRWxAQmLdAvd88Bks5q1V42gaJpZM4KZHBf .
Allowing both getobs
and viewobs
is a nice idea (as long as viewobs is the default!)
I changed how getobs
works a bit.
getobs(A::Any)
will return A
itself, but calling getobs(A::SubArray)
will return copy(A)
(and getobs(s::DataSubset)
in getobs(s.data, s.indcies)
).
This way the user can just call getobs(..)
in the inner most loop before using the data/batch in order to benefit from cache locality etc.viewobs
which is just a different name for datasubset
as long as viewobs is the default!
They are not really used internally expect from the function Edit: no longer. So it is just user-facing code. All the functions such as eachobs
, which sounds like one would expect it to getobs
every indexbatches
and shuffled
remain lazy as ever and call datasubset
directly (which to be fair is the same as viewobs
atm, but that may change).
remain lazy as ever and call datasubset directly
That's what I mean... I want to make sure everything is lazy unless you explicitly ask for copies through collect/getobs.
I checked out refactor0.5, and commented out the iterators in StochasticOptimization. I'll try to get the surrounding ecosystem up to speed and tested.
nice! There is still some new stuff missing and I will need to work on a different project for a few hours soon. But I will resume this PR later today.
No worries... I think I'm the only one using this stuff for any real projects. (I mean... I hope I'm the only one :smile: )
Question... can DataIterator and DataSubset at least subtype from a common abstract? AbstractSubset? They seem to have very similar behavior... maybe I don't quite understand their distinction yet
Well to be fair, DataIterator
doesn't need getobs
and nobs
, and DataSubset
doesn't need to support the iterator pattern. These thing are only implemented because they can be.
DataSubset
alone is not enough since there is no reliable way to implement eachobs
then for tuples, or eachbatch
for that matter.
DataSubset alone is not enough since there is no reliable way to implement eachobs then for tuples, or eachbatch for that matter.
I'm still confused here... I thought my implementation in StochasticOptimization handled any type just fine.
yea, but you boxed Tuple
into a DataSubset
, while this doesn't. This is a consequence of allowing types to offer their own views if they have such, which is the reason that if one works with Array
variables, that one never has to deal with DataSubset
This does a copy, which I was really hoping to avoid:
julia> using MLDataUtils
INFO: Recompiling stale cache file /home/tom/.julia/lib/v0.5/MLDataUtils.ji for module MLDataUtils.
julia> x,y = rand(10,2), rand(10)
(
[0.523045 0.975809; 0.963106 0.955312; … ; 0.215673 0.988101; 0.555086 0.376569],
[0.582727,0.160317,0.983288,0.0741725,0.789932,0.841186,0.119106,0.534921,0.0328525,0.0659002])
julia> for o in eachobs(x)
@show typeof(o), o
end
(typeof(o),o) = (Array{Float64,1},[0.523045,0.963106,0.0167039,0.928771,0.726239,0.656165,0.976564,0.350933,0.215673,0.555086])
(typeof(o),o) = (Array{Float64,1},[0.975809,0.955312,0.85547,0.578021,0.959107,0.769584,0.114884,0.31311,0.988101,0.376569])
this is exactly our discussion above, which I will now change to not copy
I had to introduce a immutable type DataIterator to make eachobs work. This is a consequence of not boxing Tuple into DataSubset.
Reading back through the conversation, I think this is what I don't yet understand. My implementation worked well (I thought) by boxing the tuple inside the DataSubset... I don't see why a tuple of DataSubsets would be more performant. What functionality didn't work?
I would need to either box everything into a DataSubset again, or overload functions for tuples, which would interfere with outside code when MLDataUtils is loaded
Are you talking specifically about the case of eachbatch
-style iteration? You mean that each batch would need to be re-wrapped with a DataSubset, but you want to return a tuple of views?
I am saying that I would like the following
X, Y = load_iris()
# tuple in results tuple out
tup = datasubset((X,Y), 1:10)
@assert typeof(tup) <: Tuple
x_sub, y_sub = datasubset((X,Y), 1:10)
@assert typeof(x_sub) <: SubArray
@assert typeof(y_sub) <: SubArray
train, test = splitobs(X,Y)
# train and test are Tuple of SubArrays
Now the problem arises for this kind of use:
for (x,y) in eachobs(X,Y)
# ...
end
if eachobs
calls datasubset
, it would return a Tuple
.
The code above would then call start
, next
etc for the Tuple. I would like to avoid overloading the iterator functions for Tuple
, hence eachobs
returns a DataIterator
Ok I'm playing around with it and warming up to the idea of always returning tuples/views. I guess I'm wondering if we can merge the DataIterator and DataSubset types. Do they really need to be separate?
If I find a way to remove one cleanly, I will
So the first problem I'm up against... in StochasticOptimization I dispatch the search_direction
method on whether I have a single observation or an AbstractSubset (which implies it's a batch). The problem is, I have no idea what a "single observation" looks like... only that it's not an AbstractSubset. A single observation might be a Float64, or it might be a tuple of 3D arrays. So if my loop was essentially:
for batch in eachbatch(data, size=10)
...
sd = search_direction(batch)
...
end
then I have no idea whether I'm processing one observation or one batch purely on the types.
What if batches
/splitobs
returned a DataIterator as well, and we just expand the parameters a bit to include the DataSubset functionality at the same time? The goal would be one type that represents "something to be iterated over", and a type parameter from which we could say whether we are going to get observations or batches. I'll play with that idea and see if I can come up with something.
Then we have the same situation again with the tuples. mhm. I'll think about the problem a bit after work.
Generally I would like to offer a user as little complexity as possible and make other things opt-in.
Well if I can dispatch at the MetaLearner call on what type of iterator is passed in (i.e. iterating over observations or batches), then I can somehow pass that information through myself. It's certainly more clunky and the implementation leaks, but it could work. And I agree that most people would prefer the behavior you're going for. I think I have an idea... I'll implement it and push it up as a separate branch so you can look.
What if we offer a typed API that starts with an upper case? like Eachobs
and Eachbatch
, which return specials types to dispatch on, but internally use the lower case methods to perform their duties
What if we offer a typed API start starts with an upper case? like Eachobs and Eachbatch, which return specials types to dispatch on, but internally use the lower case methods to perform their duties
Then we'd be forcing the users to always use an uppercase version when dispatch is needed? Seems confusing and fragile. I'm working on fixing this and merging DataSubset into DataIterator... give me an hour or so.
I think this is working out well. The changes:
S
parameter of DataIterator can be a vector of Int (old DataSubset) or a vector of Int-vectors (old DataSubsets)Base.show
to more clearly show what type of iterator we haveisbatches
method which tells us if we're getting an observation or a batch at each iterationHere's a demo:
julia> using MLDataUtils; n=5; x, y = rand(10,n),rand(n);
julia> datasubset(x)
DataIterator{Array{Float64,2}}: 5 observations
julia> eachobs(x)
DataIterator{Array{Float64,2}}: 5 observations
julia> datasubset(x, [1,2])
DataIterator{Array{Float64,2}}: 2 observations
julia> datasubset(x, 1:2)
DataIterator{Array{Float64,2}}: 2 batches with 2 obs each
julia> eachbatch(x)
DataIterator{Array{Float64,2}}: 5 batches with 1 obs each
julia> eachbatch(x,size=2)
INFO: The specified values for size and/or count will result in 1 unused data points
DataIterator{Array{Float64,2}}: 2 batches with 2 obs each
julia> typeof(shuffled(x))
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}
julia> batches(x)
DataIterator{Array{Float64,2}}: 5 batches with indices: UnitRange{Int64}[1:1,2:2,3:3,4:4,5:5]
julia> batches(x, size=2)
INFO: The specified values for size and/or count will result in 1 unused data points
DataIterator{Array{Float64,2}}: 2 batches with indices: UnitRange{Int64}[1:2,3:4]
julia> MLDataUtils.isbatches(ans)
true
julia> eachbatch(x,size=2)
INFO: The specified values for size and/or count will result in 1 unused data points
DataIterator{Array{Float64,2}}: 2 batches with 2 obs each
julia> MLDataUtils.isbatches(ans)
true
julia> datasubset(x, [1,2])
DataIterator{Array{Float64,2}}: 2 observations
julia> MLDataUtils.isbatches(ans)
false
julia> splitobs(x, at=(0.2,0.3,0.2))
DataIterator{Array{Float64,2}}: 4 batches with indices: UnitRange{Int64}[1:1,2:3,4:4,5:5]
julia> MLDataUtils.isbatches(ans)
true
julia> for b in splitobs(x, at=0.3)
@show typeof(b) b
end
typeof(b) = SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
b = [0.84198 0.512395; 0.447355 0.324533; 0.671367 0.305329; 0.606857 0.418458; 0.727636 0.738322; 0.836541 0.302326; 0.040053 0.254213; 0.572415 0.874371; 0.884836 0.00895098; 0.652987 0.658197]
typeof(b) = SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
b = [0.849606 0.0897501 0.256851; 0.192671 0.466011 0.737302; 0.121008 0.00902418 0.125703; 0.18939 0.0111741 0.397602; 0.295394 0.00677543 0.854273; 0.946626 0.967494 0.0430847; 0.213562 0.760838 0.468581; 0.00534368 0.40582 0.650375; 0.50318 0.987332 0.0669072; 0.79452 0.0297965 0.73686]
f68fd32
julia> splitobs(x, at=(0.2,0.3,0.2)) DataIterator{Array{Float64,2}}: 4 batches with indices: UnitRange{Int64}[1:1,2:3,4:4,5:5]
This is exactly what I want to avoid though
This is exactly what I want to avoid though
Can you explain why? This seems to be an ideal result:
julia> test,train = splitobs(x, at=0.6)
DataIterator{Array{Float64,2}}: 2 batches with indices: UnitRange{Int64}[1:3,4:5]
julia> typeof(test)
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
julia> typeof(train)
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
I'd rather not have everything be an iterator. Especially not when the user decides to do something like sub = viewobs(X, 1:4)
if X is an Array
julia> viewobs(x) |> typeof
Array{Float64,2}
julia> viewobs(x, 1:4) |> typeof
SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}
Ok, I obviously need to take a look at the actual code and not just the few examples you posted. My conference call will end soon and I shall give it an honest joyride. Sorry for half-informed pre-emptive judgement.
And I think we could take this one step further and do:
abstract DataIterator
type ObsIterator <: DataIterator ... end
type BatchIterator <: DataIterator ... end
and keep most of the same code except some of the bonus methods like infinite_obs
and infinite_batches
would have cleaner implementations.
I added another param, and an abstraction, (6593e12) and now the infinite methods are working:
julia> for (i,b) in enumerate(infinite_batches(x,y,size=2))
@show i nobs(b) typeof(b)
if i>5
break
end
end
i = 1
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 2
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 3
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 4
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 5
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
i = 6
nobs(b) = 2
typeof(b) = Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false},SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}}
julia> for (i,b) in enumerate(infinite_obs(x,y))
@show i typeof(b)
if i>5
break
end
end
i = 1
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 2
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 3
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 4
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 5
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
i = 6
typeof(b) = Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},Float64}
With the tb
branch:
INFO: StochasticOptimization tests passed
Also, my Reinforce.jl code works as well. So I'm going to go back to working on that, and we can discuss how to move forward with MLDataUtils later when you're ready.
A few thoughts
splitobs
and batches
returning an iterator just feels wrong (what's the difference to eachbatch
then?), and I think I will make them behave as they are now, returning a Vector.viewobs(my_sparse_array, 1:5)
(which atm throws an error) returning something called a DataIterator
also doesn't seem right. That could be fixed by changing the name of the Type somewhat.eachbatch(datasubset(my_sparse_array))
do? right now it nests DataIterator
and querying it throws an error. I suppose that could be fixed though.The more I play with it and think about it the more I want a special "view"/"subset" type that is only concerned with representing a lazy subset of the data, nothing more. It doesn't really reduce complexity to have one type that behaves like multiple types either, it just makes things more convoluted.
I would go the other way even, introducing a 3rd type by splitting the DataIterator
that I have up into a "batch" and "obs" version.
Concerning your dispatch problem. If a user is allowed to call your code with plain arrays that can either be batches or single observations, then you need to be able to handle native types anyways. If this sub-setting is done within your code then you have the control to use eachbatch
and eachobs
for the dispatchable types.
I really really don't want everything to be a true iterator. It feels off and non-julian to me. Using Iterators should be conscious decision I think.
How do you propose implementing the "infinite" iterators?
pretty much the same as you
It would be nice if:
infinite_obs --> ObsIterator
infinite_batches --> BatchIterator
I agree. That thought crossed my mind as well
WIP implementation for #13
I have updated
DataSubset
and its tests to 0.5 so far.datasubset
Next I'll port Tom's new verbs (in this PR)
eachobs
andshuffled
implementation and testseachbatch
documentation and testsDataIterator
documentationsplitobs
andbatches
implementation and testsviewobs
implementation and testsDataIterator
intoBatchIterator
andObsIterator
filterobs
implementation and testsinfinite_batches
andinfinite_obs
KFolds
,kfolds
,leaveout
implementation and testsAlso I shall try to update the package documentation a bit and try avoid breaking changes if I can
getobs
andnobs