Assess hyperparameter grid choices

I looked into whether certain hyperparameter values are consistently better than others, to understand whether the values that win are consistently better than other values or whether it seems like just luck.

The attached PDF uses results that I generated with the code we submitted for the second round. For each hyperparameter, the PDF has two plots:

A plot where the x axis is the hyperparameter value, and the y axis is the mean F1 score across folds. Each point is a draw that uses that hyperparameter value. The red dot is the mean F1 of the draws with that hyperparameter value. The bars are the 5th and 95th percentiles.
A plot where the x axis is the hyperparameter value, and the y axis is the standard error of the mean F1 score across folds. Each point is a draw that uses that hyperparameter value. The red dot is the mean of the standard error of the mean F1 of the draws with that hyperparameter value. The bars are the 5th and 95th percentiles.

Here was my thought process:

First, I looked at the plots of mean F1 score. I thought that if the mean of the mean F1 score across draws or the 5th percentile was notably lower for some values than others, then we should drop those hyperparameter values where things are low. In the code, I made comments for suggestions about what we might want to drop. However, instability in F1 score across draws just means that the interaction with other hyperparameters is highly variable; it doesn't necessarily mean that the performance on a different data sample will be highly variable. So, that led me to make the standard error plots.
I made the standard error plots in order to understand instability that arises from CV sampling. It generally doesn't seem like hyperparameter values differ much from one another in terms of the standard error of the mean performance across folds for draws where they are used.

I'm currently not convinced that the conclusions I drew from the first plots about which hyperparameter values to remove from the grid make sense, because I now realize that unstable performance across draws is not necessarily an issue (i.e., what we really care about is unstable performance across SAMPLES). However, I'll note that if we were to remove the grid values that I suggested removing in the comments, that would reduce our winning F1 score from 0.766 to 0.758.

examine_hyperparameter_values.pdf

@HanzhangRen Do you have any thoughts on this or suggestions for anything else we might want to investigate regarding the hyperparameter grid and/or the way we choose the winners?

Note: I don't think we should do a grid expansion using the rules that you applied in Monkeys, given that in these plots, there isn't an upward trend in mean F1 score on either side of any plot.

Note: I didn't set a seed before I generated the results (edit: I did run the line where the seed is set to 0 in training.R, but I was manually running various chunks which probably messed up the seed), which I think explains why I got a winning F1 score of 0.766 whereas you got a winning F1 score of 0.773. This makes me think that our time-shift strategy made no difference whatsoever. Also, due to a different seed/lack of seed, I didn't get stumps like you did.

citp / fertility-prediction-challenge-2024

Assess hyperparameter grid choices #24