JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

On Dev `unique(eachobs(data))` is broken for the obvious reason #25

Closed oxinabox closed 5 years ago

oxinabox commented 7 years ago
data = [[1,2,3] [1,2,3] [10,20,30] [100,200,300] [10,20,30] ]
unique(eachobs(data))

output

3-element Array{Array{Int64,1},1}:
 [10,20,30]
 [10,20,30]
 [10,20,30]

For the obvious reasons.

Do you think it is reasonable to overload collect, unique and what ever other functions in Base break for BufferGetObs to give errors when you try and use them? Or maybe wrap the BufferGetObs in another view to make it safe?

Evizero commented 7 years ago

mhm, not a bad idea to overload collect and friends. In general, do you agree that by default it makes sense that eachobs is buffering?

Evizero commented 7 years ago

There is ObsView as the safe alternative

oxinabox commented 7 years ago

In general, do you agree that by default it makes sense that eachobs is buffering?

I'm not sure there is a right answer. Pros and Cons. Fast by default is good. Safe by default is good.

I think overloading collect etc to give errors would help a fair bit. Obviously can't overload everything from all packages. But Base, and idk anything from JuliaLang would be good. I wonder if anyone on discourse has thoughts on this.

I also use this buffer pattern in CorpusLoaders.jl