Closed davidbp closed 6 years ago
I've writen some examples feel free to copy paste it to the documentation. If you want I can make a pull request. The examples below can be executed in this notebook kfolds_examples
I think the printed information is quite informative of what the method is doing and what the kfolds method returns depeding on the input it recieves (tuple/FoldsView object).
The function MLDataUtils.kfolds
has been implemented with 3 different input types in mind
MLDataUtils.kfolds(Integer, K)
MLDataUtils.kfolds(Array, K)
.MLDataUtils.kfolds((Array,Array), K)
.MLDataUtils.kfolds(Integer, K)
integer
, it is assumed to be the number of examples.K
different arrays containing the indicies for training. The second element of the tuple corresponds to K
differnt arrays containing the indicies for validation.folds = MLDataUtils.kfolds(15, 5)
typeof(folds)
# Tuple{Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}
folds[1]
#5-element Array{Array{Int64,1},1}:
# [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# [1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# [1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15]
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15]
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
folds[2]
#5-element Array{UnitRange{Int64},1}:
#1:3
#4:6
#7:9
#10:12
#13:15
MLDataUtils.kfolds(Array, K)
AbstractArray
, which is assumed to contain examples as columns.MLDataPattern.FoldsView
containing views of the folds.X, Y = MLDataUtils.load_iris()
folds = MLDataUtils.KFolds(X, 10)
typeof(folds)
# MLDataPattern.FoldsView{Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false},SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}},Array{Float64,2},LearnBase.ObsDim.Last,Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}
for f in folds
println(size(f[1]), size(f[2]))
end
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
MLDataUtils.kfolds( (Array, Array) K)
MLDataPattern.FoldsView
containing views of the folds.If folds = MLDataUtils.KFolds((X,Y), 10)
then folds[n_fold][n_tr_va][n_X_Y]
will return the fold specified by n_fold
containing train or validation according to n_tr_va
and the data or labels according to n_X_Y
.
Example with iris data
X, Y = MLDataUtils.load_iris()
folds = MLDataUtils.KFolds((X,Y), 10)
typeof(folds)
# MLDataPattern.FoldsView{Tuple{Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false},SubArray{String,1,Array{String,1},Tuple{Array{Int64,1}},false}},Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true},SubArray{String,1,Array{String,1},Tuple{UnitRange{Int64}},true}}},Tuple{Array{Float64,2},Array{String,1}},Tuple{LearnBase.ObsDim.Last,LearnBase.ObsDim.Last},Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}
for f in folds
println(size(f[1]), size(f[2]))
end
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# (4, 120)(4, 30)
# fold 1: train data
folds[1][1][1]
# 4×120 SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false}:
# 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 … 6.8 6.7 6.7 6.3 6.5 6.2 5.9
# 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.2 3.3 3.0 2.5 3.0 3.4 3.0
# 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 5.9 5.7 5.2 5.0 5.2 5.4 5.1
# 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 2.3 2.5 2.3 1.9 2.0 2.3 1.8
# fold 1: train labels
folds[1][1][2]
# 30-element SubArray{String,1,Array{String,1},Tuple{UnitRange{Int64}},true}:
# "setosa"
# "setosa"
# "setosa"
# fold 1: validation data
folds[1][2][1]
# 4×30 SubArray{Float64,2,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}:
# 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 … 5.1 4.8 5.0 5.0 5.2 5.2 4.7
# 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 3.3 3.4 3.0 3.4 3.5 3.4 3.2
# 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.7 1.9 1.6 1.6 1.5 1.4 1.6
# 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2
# fold 1: validation labels
folds[1][2][2]
#30-element SubArray{String,1,Array{String,1},Tuple{UnitRange{Int64}},true}:
# "setosa"
# "setosa"
Very nice. I will definitely think on this and get back to you.
two little nuggets of information that came to mind while looking through your notebook.
kfolds(tuple, K)
here the tuple
isn't actually limited to two elements. Throughout the MLDataPattern package you'll find that never is the case. This may not be that super common of a feature to be able to pass along three data objects instead of two, but it is possible.MLDatasets.MNIST
has two little helper functions, one of which is called MNIST.convert2features
. This function conveniently does the correct reshaping for you (see https://github.com/JuliaML/MLDatasets.jl/blob/master/src/MNIST/README.md)About
Throughout the MLDataPattern package you'll find that never is the case. This may not be that super common of a feature to be able to pass along three data objects instead of two, but it is possible.
If the library allows it we it might be interesting to have a usage example (i can't think about one right now).
In some cases, there are no "official test sets". In such cases it makes sense to have the option to do train/validation/test splits (instead of a fixed test split that you do at the beginning), I could not find an example on how to do this. Obviously we could always make a FoldsView of a FoldsView) but is there an option to do it automatically?
I mean do an split of this sort:
[tr][tr][tr][tr][tr][va][va][te][te]
[te][te][tr][tr][tr][tr][tr][va][va]
[va][va][te][te][tr][tr][tr][tr][tr]
...
In some cases, there are no "official test sets". In such cases it makes sense to have the option to do train/validation/test splits (instead of a fixed test split that you do at the beginning)
I am torn if we should encourage this kind of setup. It seems rather easy to bias your experiment if the "testset" is chosen this way. The fixed testsplit at the beginning serves the purpose of holding out a test portion during the whole training phase of the experiment. This includes the iterative process of the user staring at the sub-optimal results and tweaking hyper-parameter / network architecture to improve the accuracy. Peeking at the test error over and over leaks information. The validation set should suffice for this.
Why do you want to do it this way? convenience? "disconnecting" early stopping from the error estimation? Is there literature doing it this way?
The fixed testsplit at the beginning serves the purpose of holding out a test portion during the whole training phase of the experiment.
Well that depends on how you use the K test/validation sets (and what you mean). My explanation was prune to misuderstanding anyway. Sometimes algorithms (like you pointed out in the case of early stopping) might benefit from splitting your training data into 2 parts to have some data to select when to stop learning. For example some scikit-learn models has a validation_fraction
and when you do model.fit(X,Y) , you divide X into 2 sets. When you do gridsearch then you are actually doing 3 splits like I wrote.
Maybe it's prone to misuse and I have no special need for such a feature so forget about it :). Probably saying validation data
on my second post was misleading, each piece of validation data
was meant to refer to a test set.
i just updated the readme to reflect how FoldsView is displayed now. I think this should avoid confusion.
In the main page of MLDatautils there is this example showing how to use the kfolds function:
I would erase this example and write two examples 1) and 2)
1) one taking a 2D Array (or NDarray) showing that the methods returns views of the array which can be directly used (fed to a model for example). In fact, the source code suggests that data does not need to be an Array, what can we feed to the function and expect to be working properly? (can we feed a Dataframe?) 2) one taking an integer value and returning the indicies that then the user can use to make the partitions.
The current example I se seems to be a case 1) example. Nevertheless it is quite misleading since it can be understood as an example of getting the indicies of a dataset with 10 samples.
How does kfold know wheather the columns of the Array are the samples or not? It is the preferred case to be in Julia but for example, Dataframes do not work this way.
Maybe it should also mention the types inside the package, why is the function returning a type FoldsView instead of a View ?