JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

FoldsView show method too complex? #36

Closed davidbp closed 7 years ago

davidbp commented 7 years ago

Right now a FoldsView object when declared prints a lot of information. In my opinion it would make sense more sense to retrieve only some relevant information

Example

Example with iris data does not seem very understandable

X_iris, Y_iris = MLDataUtils.load_iris()
folds = kfolds(X_iris, 10)
10-element FoldsView(::Array{Float64,2}, ::Array{Array{Int64,1},1}, ::Array{UnitRange{Int64},1}, ObsDim.Last()) with element type Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false},SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}}:
 ([5.7 5.4 … 6.2 5.9; 4.4 3.9 … 3.4 3.0; 1.5 1.3 … 5.4 5.1; 0.4 0.4 … 2.3 1.8], [5.1 4.9 … 4.3 5.8; 3.5 3.0 … 3.0 4.0; 1.4 1.4 … 1.1 1.2; 0.2 0.2 … 0.1 0.2])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [5.7 5.4 … 5.2 4.7; 4.4 3.9 … 3.4 3.2; 1.5 1.3 … 1.4 1.6; 0.4 0.4 … 0.2 0.2])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [4.8 5.4 … 5.0 5.1; 3.1 3.4 … 3.5 3.8; 1.6 1.5 … 1.6 1.9; 0.2 0.4 … 0.6 0.4])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [4.8 5.1 … 6.6 5.2; 3.0 3.8 … 2.9 2.7; 1.4 1.6 … 4.6 3.9; 0.3 0.2 … 1.3 1.4])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [5.0 5.9 … 6.1 6.4; 2.0 3.0 … 2.8 2.9; 3.5 4.2 … 4.7 4.3; 1.0 1.5 … 1.2 1.3])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [6.6 6.8 … 5.6 5.5; 3.0 2.8 … 3.0 2.5; 4.4 4.8 … 4.1 4.0; 1.4 1.4 … 1.3 1.3])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [5.5 6.1 … 6.3 6.5; 2.6 3.0 … 2.9 3.0; 4.4 4.6 … 5.6 5.8; 1.2 1.4 … 1.8 2.2])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [7.6 4.9 … 7.7 6.0; 3.0 2.5 … 2.6 2.2; 6.6 4.5 … 6.9 5.0; 2.1 1.7 … 2.3 1.5])
 ([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], [6.9 5.6 … 6.3 6.1; 3.2 2.8 … 2.8 2.6; 5.7 4.9 … 5.1 5.6; 2.3 2.0 … 1.5 1.4])
 ([5.1 4.9 … 6.3 6.1; 3.5 3.0 … 2.8 2.6; 1.4 1.4 … 5.1 5.6; 0.2 0.2 … 1.5 1.4], [7.7 6.3 … 6.2 5.9; 3.0 3.4 … 3.4 3.0; 6.1 5.6 … 5.4 5.1; 2.3 2.4 … 2.3 1.8])

Maybe if it printed something like...

MLDataPattern.FoldsView(data=X_iris, n_samples=150, n_folds=10, tr_sizes=(4,135), va_sizes=(4,15))

it would be easier to grasp that the type gives you

Evizero commented 7 years ago

Mhm, I see your point. The main thing here is that a FoldsView is a subtype of AbstractVector, so here we don't actually highjack the printing, its done with Base code.

davidbp commented 7 years ago

Is it really intended to be a vector? When I think of a vector I think about operations in a vector-space. It doesn't seem the case that this type will ever need to have any sort of algebra. I see it as a "placehodler" containing useful information.

At first I even thought that there was no need to have a Type to contain the folds. I though we could use an array (or array of pairs/triplets ...) of views. I think now that having a type can facilitate further abstractions so I'm OK with it, it's just that I see too much stuff that is not meaningful to me when printing. In the example above the following info retrieved

 FoldsView(::Array{Float64,2}, ::Array{Array{Int64,1},1}, ::Array{UnitRange{Int64},1}, ObsDim.Last()) with element type Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false},SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}}

which seems too much.

Evizero commented 7 years ago

Is it really intended to be a vector?

well

I though we could use an array (or array of pairs/triplets ...) of views

it is exactly a lazy version of that.

it's just that I see too much stuff that is not meaningful to me when printing.

I fully agree there.

Evizero commented 7 years ago

To be a little more concrete. I am in favour of highjacking show to print more meaningful infos

MLDataPattern.FoldsView(data=X_iris, n_samples=150, n_folds=10, tr_sizes=(4,135), va_sizes=(4,15))

The main reason I don't like this specific version of it is because it looks like code with which one could construct the same object with.

Maybe some multiline summary

10-element FoldsView of 150 observations:
  data: (4×150 Array{Float64,2}, 2-element Array{Float64,1})
  training: 135 observations
  validation: 25 observations
  obsdim: ObsDim.Last()

keep in mind that the data need not be arrays

davidbp commented 7 years ago

I never though about

The main reason I don't like this specific version of it is because it looks like code with which one could construct the same object with.

It's a good point. I like though to have the info of types in a single line when using them but it's a personal preference I guess. Having the info like in

10-element FoldsView of 150 observations: data: (4×150 Array{Float64,2}, 2-element Array{Float64,1}) training: 135 observations validation: 25 observations obsdim: ObsDim.Last()

It's definitely an improvement for the user.

Could you expand on what is

obsdim: ObsDim.Last() ?

Evizero commented 7 years ago

ObsDim is a dispatchable way to allow for different conventions as to what denotes an observation (eg. row vs column). We want to support both.

see http://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html#observation-dimension

Evizero commented 7 years ago

fixed with https://github.com/JuliaML/MLDataPattern.jl/pull/16