JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

Make folds function documentation more clear #35

Closed davidbp closed 6 years ago

davidbp commented 7 years ago

In the main page of MLDatautils there is this example showing how to use the kfolds function:

folds = kfolds([1,2,3,4,5,6,7,8,9,10], k = 5)
# 5-element MLDataPattern.FoldsView{Tuple{SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false},SubArray{Int64,1,Array{Int64,1},Tuple{UnitRange{Int64}},true}},Array{Int64,1},LearnBase.ObsDim.Last,Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}:
#  ([3,4,5,6,7,8,9,10],[1,2])
#  ([1,2,5,6,7,8,9,10],[3,4])
#  ([1,2,3,4,7,8,9,10],[5,6])
#  ([1,2,3,4,5,6,9,10],[7,8])
#  ([1,2,3,4,5,6,7,8],[9,10])

I would erase this example and write two examples 1) and 2)

1) one taking a 2D Array (or NDarray) showing that the methods returns views of the array which can be directly used (fed to a model for example). In fact, the source code suggests that data does not need to be an Array, what can we feed to the function and expect to be working properly? (can we feed a Dataframe?) 2) one taking an integer value and returning the indicies that then the user can use to make the partitions.

The current example I se seems to be a case 1) example. Nevertheless it is quite misleading since it can be understood as an example of getting the indicies of a dataset with 10 samples.

How does kfold know wheather the columns of the Array are the samples or not? It is the preferred case to be in Julia but for example, Dataframes do not work this way.

Maybe it should also mention the types inside the package, why is the function returning a type FoldsView instead of a View ?

davidbp commented 7 years ago

I've writen some examples feel free to copy paste it to the documentation. If you want I can make a pull request. The examples below can be executed in this notebook kfolds_examples

I think the printed information is quite informative of what the method is doing and what the kfolds method returns depeding on the input it recieves (tuple/FoldsView object).

The function MLDataUtils.kfolds has been implemented with 3 different input types in mind

MLDataUtils.kfolds(Integer, K)

folds = MLDataUtils.kfolds(15, 5)

typeof(folds)
# Tuple{Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}

folds[1]
#5-element Array{Array{Int64,1},1}:
# [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# [1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# [1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15]
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15]   
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]   

folds[2]
#5-element Array{UnitRange{Int64},1}:
#1:3  
#4:6  
#7:9  
#10:12
#13:15

MLDataUtils.kfolds(Array, K)

X, Y = MLDataUtils.load_iris()
folds = MLDataUtils.KFolds(X, 10)

typeof(folds)
# MLDataPattern.FoldsView{Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false},SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}},Array{Float64,2},LearnBase.ObsDim.Last,Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}

for f in folds
    println(size(f[1]), size(f[2]))
end

# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)

MLDataUtils.kfolds( (Array, Array) K)

Summary

If folds = MLDataUtils.KFolds((X,Y), 10) then folds[n_fold][n_tr_va][n_X_Y] will return the fold specified by n_fold containing train or validation according to n_tr_va and the data or labels according to n_X_Y.

Example with iris data

X, Y = MLDataUtils.load_iris()
folds = MLDataUtils.KFolds((X,Y), 10)

typeof(folds)
# MLDataPattern.FoldsView{Tuple{Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false},SubArray{String,1,Array{String,1},Tuple{Array{Int64,1}},false}},Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true},SubArray{String,1,Array{String,1},Tuple{UnitRange{Int64}},true}}},Tuple{Array{Float64,2},Array{String,1}},Tuple{LearnBase.ObsDim.Last,LearnBase.ObsDim.Last},Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}

for f in folds
    println(size(f[1]), size(f[2]))
end

# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)

# fold 1: train data
folds[1][1][1]
# 4×120 SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false}:
# 4.8  5.4  5.2  5.5  4.9  5.0  5.5  4.9  …  6.8  6.7  6.7  6.3  6.5  6.2  5.9
# 3.1  3.4  4.1  4.2  3.1  3.2  3.5  3.6     3.2  3.3  3.0  2.5  3.0  3.4  3.0
# 1.6  1.5  1.5  1.4  1.5  1.2  1.3  1.4     5.9  5.7  5.2  5.0  5.2  5.4  5.1
# 0.2  0.4  0.1  0.2  0.2  0.2  0.2  0.1     2.3  2.5  2.3  1.9  2.0  2.3  1.8

# fold 1: train labels
folds[1][1][2]
# 30-element SubArray{String,1,Array{String,1},Tuple{UnitRange{Int64}},true}:
# "setosa"
# "setosa"
# "setosa"

# fold 1: validation data
folds[1][2][1]
# 4×30 SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}:
# 5.1  4.9  4.7  4.6  5.0  5.4  4.6  5.0  …  5.1  4.8  5.0  5.0  5.2  5.2  4.7
# 3.5  3.0  3.2  3.1  3.6  3.9  3.4  3.4     3.3  3.4  3.0  3.4  3.5  3.4  3.2
# 1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5     1.7  1.9  1.6  1.6  1.5  1.4  1.6
# 0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2     0.5  0.2  0.2  0.4  0.2  0.2  0.2

# fold 1: validation labels
folds[1][2][2]
#30-element SubArray{String,1,Array{String,1},Tuple{UnitRange{Int64}},true}:
# "setosa"
# "setosa"
Evizero commented 7 years ago

Very nice. I will definitely think on this and get back to you.

two little nuggets of information that came to mind while looking through your notebook.

davidbp commented 7 years ago

About

Throughout the MLDataPattern package you'll find that never is the case. This may not be that super common of a feature to be able to pass along three data objects instead of two, but it is possible.

If the library allows it we it might be interesting to have a usage example (i can't think about one right now).

Another Feature Request: Allowing easy train/test/split Crossvalidation

In some cases, there are no "official test sets". In such cases it makes sense to have the option to do train/validation/test splits (instead of a fixed test split that you do at the beginning), I could not find an example on how to do this. Obviously we could always make a FoldsView of a FoldsView) but is there an option to do it automatically?

I mean do an split of this sort:

[tr][tr][tr][tr][tr][va][va][te][te]
[te][te][tr][tr][tr][tr][tr][va][va]
[va][va][te][te][tr][tr][tr][tr][tr]
...
Evizero commented 7 years ago

In some cases, there are no "official test sets". In such cases it makes sense to have the option to do train/validation/test splits (instead of a fixed test split that you do at the beginning)

I am torn if we should encourage this kind of setup. It seems rather easy to bias your experiment if the "testset" is chosen this way. The fixed testsplit at the beginning serves the purpose of holding out a test portion during the whole training phase of the experiment. This includes the iterative process of the user staring at the sub-optimal results and tweaking hyper-parameter / network architecture to improve the accuracy. Peeking at the test error over and over leaks information. The validation set should suffice for this.

Why do you want to do it this way? convenience? "disconnecting" early stopping from the error estimation? Is there literature doing it this way?

davidbp commented 7 years ago

The fixed testsplit at the beginning serves the purpose of holding out a test portion during the whole training phase of the experiment.

Well that depends on how you use the K test/validation sets (and what you mean). My explanation was prune to misuderstanding anyway. Sometimes algorithms (like you pointed out in the case of early stopping) might benefit from splitting your training data into 2 parts to have some data to select when to stop learning. For example some scikit-learn models has a validation_fraction and when you do model.fit(X,Y) , you divide X into 2 sets. When you do gridsearch then you are actually doing 3 splits like I wrote.

Maybe it's prone to misuse and I have no special need for such a feature so forget about it :). Probably saying validation data on my second post was misleading, each piece of validation data was meant to refer to a test set.

Evizero commented 6 years ago

i just updated the readme to reflect how FoldsView is displayed now. I think this should avoid confusion.