Control splitting of data in folds when calling crossvalidate

ngiann commented 2 years ago

I think that whenever crossvalidate gets called, the data are split each time in different folds:

function test()

        X = randn(100, 30)        # fake inputs

        T = [ones(50); -ones(50)] # fake targetrs

        # call three times

        Random.seed!(1)

        rf = SKLearner("RandomForestClassifier",Dict(:n_estimators=>30,:random_state=>0));

        m1, s, = crossvalidate(rf, DataFrame(X, :auto), T, "balanced_accuracy_score", 10)

        Random.seed!(1)

        rf = SKLearner("RandomForestClassifier",Dict(:n_estimators=>30,:random_state=>0));

        m2, s, = crossvalidate(rf, DataFrame(X, :auto), T, "balanced_accuracy_score", 10)

        Random.seed!(1)

        rf = SKLearner("RandomForestClassifier",Dict(:n_estimators=>30,:random_state=>0));

        m3, s, = crossvalidate(rf, DataFrame(X, :auto), T, "balanced_accuracy_score", 10)

        # return means

        m1, m2, m3

end

I looked in the code and saw that Kfolds from MLBase is called at this line. Unfortunately, MLBase provides no controls over the creation of the folds, i.e. setting a random seed that controls the generation of the random folds.

Obviously this is not a bug, but I just wanted to point this out.

If crossvalidate returns different folds every time it is called, then comparing different learners becomes problematic as they are not trained and tested on the same data subsets.

If this issue is worth addressing, then I will consider opening a PR.

Thanks.

ppalmes commented 2 years ago

try: using Random Random.seed!(123)

using specific seed before you call crossvalidate should always return the same result. let me know if this is not the case.

ngiann commented 2 years ago

I accordingly updated the code in the example above, but it seems that the issue persists.

ppalmes commented 2 years ago

interesting. i’m not in the computer right now. can you try calling same seed to the same rf object twice? seed 1 then crossvalidate rf. seed 1 again and crossvalidate same rf. i suspect if you start with new initialization, sklearn creates different rf trees. also, can you try the julia random forest instead of sklearn? just wanna see the consistency of their implementation.

jrf = RandomForest()

ppalmes commented 2 years ago

by the way, you can still compare different algorithms by having different splits because crossvalidation returns mean and standard deviation. you can use anova or t-test to compare two algorithms using their mean and std. this is the whole point of crossvalidation. it is a good estimate of the average performance of the algo specially if you use n-1 cross-validation.

ppalmes commented 2 years ago

by the way, you are welcome to create a PR. i suggest you can create a setseed function that when called will set the seed of both Julia random number as well as the sklearn seed and i think numpy seed because python seeding are implemented separately in different libraries. python doesn’t have one global seeding similar to Julia.

ngiann commented 2 years ago

interesting. i’m not in the computer right now. can you try calling same seed to the same rf object twice?

I commented out the two subsequent calls to SKLearner("RandomForestClassifier",Dict(:n_estimators=>30,:random_state=>0)); but the results are still different.

When I try to use the Julia random forest I get some errors about "mixed types".

ngiann commented 2 years ago

by the way, you can still compare different algorithms by having different splits because crossvalidation returns mean and standard deviation. you can use anova or t-test to compare two algorithms using their mean and std. this is the whole point of crossvalidation. it is a good estimate of the average performance of the algo specially if you use n-1 cross-validation.

True. However, in my case I am interested in optimising an objective function that is based on the performance of an algorithm as estimated by cross-validation. If cross-validation is using new folds every time it is called, then my objective function becomes "noisy".

ngiann commented 2 years ago

Culprit seems to be the following:

using MLBase

k = Kfold(10,3)
collect(k)
k = Kfold(10,3)
collect(k)

Every time we call Kfold it returns different partitions (which is of course not unreasonable).

My current thinking revolves around the following (somewhat dirty) solution regarding this line where Kfold is called. One could do something like:

using Random, MLBase
rg = MersenneTwister(seed)
aux = Kfold(length(Y),nfolds)
aux.permseq .= randperm(rg, length(Y))
folds = collect(aux)

ppalmes commented 2 years ago

you don’t want the same fold i think unless you want replicability. the idea of different folds is to cover different combinations and permutations of subsets of data because you are estimating an unknown parameter. if it is always the same fold, you won’t cover other combinations and your estimate is biased.

ppalmes commented 2 years ago

also, if you set the same seed before you call the kfold, it should be the same. but you dont want to stick the same seed everytime you call the fold. you want the sequence of random numbers starting with a seed to be as unique as possible and generate unique folds.

ppalmes commented 2 years ago

the idea of cross-validation is for the final estimate to be not dependent on the random splits. it should arrive at similar results because you are estimating a population parameter using random samples by bootstrapping. the final estimate should not be dependent on the folds because all crossvalidation results should arrive at similar estimates if the data/algo is well-behaved and average performance with std follows normal distribution.

ngiann commented 2 years ago

Performance should not be dependent on the fold, I agree. But you also introduce noise in your estimate because of the resampling. The more folds you include the better I suppose, but practically you can't have them all so you make a compromise (in my case I settle for 10 folds). Maximising objectives that are based on cross-validated estimates is not unusual practice.

ppalmes commented 2 years ago

yeah. if you are comparing two algos and their average performance are not much different, then the variability or std need to be tighter to check if there is significant difference. always a tradeoff between bias vs variance.

ngiann commented 2 years ago

But typically one does compare algorithms on the same folds, or? (sorry for all this digression from the original issue). Indeed, as you pointed out above, reproducibility is the main issue.

ppalmes commented 2 years ago

statistically, you don’t need the same fold. it’s like tossing a coin. if the coin is a balanced coin, the sequence of head/tail won’t matter because if you measure it with enough number of times, it is 50/50. same with cross-validation. if the data is well-behaved, the splits won’t matter because inferentially, it will estimate the population average performance as long as you have enough splits to use. it is a single value that you want to estimate and crossvalidation tries to produce that single value with some standard deviation. if the estimate fluctuates so much based on standard deviation, it is possible your algo is very greedy and cannot get a global optima. for example if you use neural network and you used a very small learning rate relative to the optimal one, gradient search which is a greedy algo will be trapped to local optima which is dependent on your weight and parameter starting points. it can be that the data is well-behaved but the algo is unstable. random forest and svm are quiet stable algo so you can use them as baselines. the data can be also imbalanced which causes a big problem in estimating the performance of the algo. you need to address data bias as well as stability of algos.

ngiann commented 2 years ago

Indeed you don't need the same fold. Also, there is nothing wrong with the practice of comparing on the same folds. I think the way to increase robustness of your statistics is to increase the number of folds rather than randomise them.

ppalmes commented 2 years ago

no. the entire inferential statistics is based on random numbers. you need uniform random sampling because central limit theorem starts from this assumption. every sample should be equally likely and each sample has equal chance to be picked. randomization is an important ingredient to avoid bias sampling. if you use the same fold all the time, your comparison will be biased. it is better you rely on randomization and measure many times to reduce standard deviation. the more random is your choice, the more you are confident you covered all possibilities to measure the performance of your algo. if there is a dominating result, it will come out and those poor performance which might happen but very less often will be just become in the tail or outlier of the performance distribution.

ppalmes commented 2 years ago

to measure the performance, you should rely on randomization to avoid observation bias because some can pick a specific seed that shows their algo performs well. in real evaluation, the choice of starting seed should also be random to avoid observation bias. measure many times and randomize the choices to make as fair as possible the measurement of performance by making sure all possibilities have equal chance to be picked unless data needs to be normalized by stratification or imbalanced classes, etc.

ppalmes commented 2 years ago

one thing that i will suggest to measure performance is: create 30x 10-fold crossvalidation and get the mean and std of the 30x means. why 30? because 30 is the magic number to ensure that the performance distribution is normal or follows the bell-shaped distribution. the only thing to consider is if the process takes a lot of time. but you can always do @distributed in julia to run each crossval independently and in parallel.

ngiann commented 2 years ago

thanks for the advice!

IBM / AutoMLPipeline.jl

Control splitting of data in folds when calling crossvalidate #107