citp / fertility-prediction-challenge-2024

Fertility prediction challenge
MIT License
0 stars 1 forks source link

Assess hyperparameter grid choices #24

Open emilycantrell opened 4 months ago

emilycantrell commented 4 months ago

I looked into whether certain hyperparameter values are consistently better than others, to understand whether the values that win are consistently better than other values or whether it seems like just luck.

The attached PDF uses results that I generated with the code we submitted for the second round. For each hyperparameter, the PDF has two plots:

Here was my thought process:

I'm currently not convinced that the conclusions I drew from the first plots about which hyperparameter values to remove from the grid make sense, because I now realize that unstable performance across draws is not necessarily an issue (i.e., what we really care about is unstable performance across SAMPLES). However, I'll note that if we were to remove the grid values that I suggested removing in the comments, that would reduce our winning F1 score from 0.766 to 0.758.

Screenshot 2024-05-16 at 8 08 12 PM

examine_hyperparameter_values.pdf

@HanzhangRen Do you have any thoughts on this or suggestions for anything else we might want to investigate regarding the hyperparameter grid and/or the way we choose the winners?

Note: I don't think we should do a grid expansion using the rules that you applied in Monkeys, given that in these plots, there isn't an upward trend in mean F1 score on either side of any plot.

Note: I didn't set a seed before I generated the results (edit: I did run the line where the seed is set to 0 in training.R, but I was manually running various chunks which probably messed up the seed), which I think explains why I got a winning F1 score of 0.766 whereas you got a winning F1 score of 0.773. This makes me think that our time-shift strategy made no difference whatsoever. Also, due to a different seed/lack of seed, I didn't get stumps like you did.

HanzhangRen commented 3 weeks ago

I adjusted the grid a bit because after I made some other changes, it seemed that models with deep trees and small learning rates have become more likely to win. It is not quite clear how much this adjustment helped in terms of the F-1 score. The F-1 score for the latest model version is 0.7963717. If I were to revert to the old grid, the F-1 score is 0.7936770. We observe a very small improvement of 0.003. Not sure if it's a real improvement.