Question about Cross-Validation for a downstream task

PaulForInvent commented 3 years ago

Hey,

do you think, I should use cross-validation of my trainingdata while fine-tune a model for semantic search (and simalirity task)?

Surprisingly I always ignored this...

nreimers commented 3 years ago

If you perform an ablation on e.g. what is the best model, what is the best loss, what are the best parameters, then using CV can make sense if it is computationally feasible.

PaulForInvent commented 3 years ago

@nreimers Thanks.

Just try to do it, but I saw that for kfold you need of course the RandomSubsetSamplers. In my case I use the SentencesLabelDataset which is a IterableDataset and cannot be used with a sampler. That is bad.

Is it possible to have the SentencesLabelDataset as normal Dataset?

nreimers commented 3 years ago

It would be better to first create the fold, and the re-init your SentencesLabelDataset

PaulForInvent commented 3 years ago

It would be better to first create the fold, and the re-init your SentencesLabelDataset

So you suppose to create the fold without any pytorch dataset? But isn't it possible to chnage the SentencesLabelDataset to a normal dataset by replacing yield by return eg...?

nreimers commented 3 years ago

I think it is easier to first create your different folds, and then create a new SentencesLabelDataset from it.

PaulForInvent commented 3 years ago

I think it is easier to first create your different folds

For this I like to do it with a dataset and a SubsetSampler to sample the folds in a pytorch way? Or how would you create the fold?

PhilipMay commented 3 years ago

Maybe you want to have a look here: https://github.com/German-NLP-Group/xlsr

In this script: https://github.com/German-NLP-Group/xlsr/blob/main/xlsr/train_optuna_stsb.py

There I use cross validation as I think it is useful.

PhilipMay commented 3 years ago

I prefer to use crossvalidation when I do automated hyperparameter search. The reasons are:

cross validation reduces overfitting on the validation set when you do automated hyperparameter search
through using multiple val. sets they better cover your data space when working with small datasets
because neural networks are random initialized the random effects on the results are reduced when you calculate the mean of the folds

PaulForInvent commented 3 years ago

@PhilipMay Thanks, Maybe using simple arrays is better. I wanted to do it like here: https://www.machinecurve.com/index.php/2021/02/03/how-to-use-k-fold-cross-validation-with-pytorch/

But I think the SentencesLabelDataset can be rewritten to a simple dataset.

I saw you are using optimizer parameters for tuning like weight decay. Did you find any improvement by that? I found that tuning learningrate is not very usefull (at least in my case).

PhilipMay commented 3 years ago

@PaulForInvent

Here is the Optuna Importance Plot

grafik

PhilipMay commented 3 years ago

@PaulForInvent

and the slice plot

grafik

PaulForInvent commented 3 years ago

I asked this myself too.

https://github.com/UKPLab/sentence-transformers/issues/791

PaulForInvent commented 3 years ago

Now I have a different issue. Since I use mainly Batchhard-Losses, I have examples with its class labels. Up to now for evaluation I used a ranking metric on a different validation set. Now, I wonder how do I evaluate my model on each fold, as now, both data are similar structured (meaning these are just labeled examples). Now I could use a evaluation metric if the class label is predicted correctly (multi class task) or a triplet Evaluator...

My main task is actually Ranking, so I also would like to do a ranking evaluation for each fold...but since my fold is fixed I cannot do a ranking task and just have to use the available samples of each class (possibly with a ParaphraseMiningEvaluator())?

Oh this come in my mind right now: someone has used a combination of aranking metric like MRR and a binaray metric like precision to combine for evaluation (and using for parameter tuning)? @PhilipMay @nreimers

PaulForInvent commented 3 years ago

@nreimers :

I wonder if your ParaphraseMiningEvaluator or BinaryClassificationEvaluator handles the case for ignoring self refeernces by calculating the cosine score of a list of sentences with itself?

nreimers commented 3 years ago

It computes whatever you pass as your data. The ParaphraseMiningEvaluator ignores self references.

PaulForInvent commented 3 years ago

@PhilipMay I just saw that your are drawing the parameters in each fold new. I did this same thing too. But shouldn't be the parameters for all folds the same?

I also try to find out, how to build the final model, after I used CV for finding the best parameters? This seems a very heavy discussed topic...

Should I then retrain the model using all trainingdata ? Also, despite setting a seed, you cannot guarantee that each model trained with the same parameters yield the same results... So you should save each model during CV and then continue finetuning on all the data? Any standard way you experiences to be good? @nreimers

PhilipMay commented 3 years ago

I just saw that your are drawing the parameters in each fold new.

No. It just seems like that. When you draw them from optuna multiple times the 2nd and alls following times it returns the same value until the trial is over.

I did this same thing too. But shouldn't be the parameters for all folds the same?

It should (must) be all the same.

PhilipMay commented 3 years ago

Should I then retrain the model using all trainingdata ? Also, despite setting a seed, you cannot guarantee that each model trained with the same parameters yield the same results... So you should save each model during CV and then continue finetuning on all the data? Any standard way you experiences to be good? @nreimers

I hate seeds and do not use them when doing HP optimization with CV. I just do many CV steps and average them. CV is only about Hyperparameter finding and not about model creation.

When I want to create the "best" final model I train the model with best HP set on the full dataset at the end.

PaulForInvent commented 3 years ago

I hate seeds and do not use them when doing HP optimization with CV. I just do many CV steps and average them. CV is only about Hyperparameter finding and not about model creation.

When I want to create the "best" final model I train the model with best HP set on the full dataset at the end.

Yes. That is straight forward if always the model behaves the same every training run with the same hyperparameters. I found that some loss types, the results vary (also strongly) training same model each time. But I feel I am the only having this problem... That's why I set sometimes seeds and save each model for each parameter set. But then I can only use this trained model to train it on all the data... This is maybe different from training from scratch with the found best parameters.

PaulForInvent commented 3 years ago

@PhilipMay What are your experience with randomness? If I do HP search and try to retrain the model with any parameters I get always different results. So, just finding the HPs is not meaningful as not reproducible...

PhilipMay commented 3 years ago

@PaulForInvent just because there is randomness and you get different results does not mean it is not useful.

For example: I have usecases with small data sets (6000) where I do 10 fold x-validation. The result is the mean of the folds. That helps to reduce the effect of randomness.

PhilipMay commented 1 year ago

By the way. I saw that the stsb dataset has duplicate sentences in the train set. So doing cross validation might be no good idea since you might have information leakage from train to validation...

@PaulForInvent

UKPLab / sentence-transformers

Question about Cross-Validation for a downstream task #984