Setting up cross-validation

citp / fertility-prediction-challenge-2024

Fertility prediction challenge

MIT License

0 stars 1 forks source link

Setting up cross-validation #4

Open HanzhangRen opened 7 months ago

HanzhangRen commented 7 months ago

In pull requests #1 and #2, I made sure that the codebase is working by running an empty job following the PreFer team's instruction here.

Then, in pull request #3, I made some additional edits, mainly in training.R. The goal is that instead of directly producing the logistic model with glm(), I'm now producing the logistic model with the caret package, which seems to be the most popular google search result when it comes to doing cross validation in R. We can now have a rough sense of how well our models are performing by looking at the 5-fold cross validation F1 scores within model$resample.

For now, the changes are rather useless for two reasons:

At this moment, we are fitting only one model (logistic) and not doing hyperparameter tuning, so we only have cv score for one pipeline and nothing to compare it to. However, we will soon have more pipelines. Caret supports hyperparameter grids as well as a wide range of models. We can also customize our models if we want to.
Currently, the model is predicting childbirth using only the age variable. It's not a good algorithm, and the algorithm is predicting that nobody has kids. As a result, the F1 score is actually NaN, and I actually have to replace it with 0 (also the recall score) for the code to work. Models that actually work likely won't have this problem.

My next step, however, will probably be to translate the code I wrote from caret to tidymodels. It appears that caret is a little outdated, and the author of caret is now working on tidymodels (Did Matt mention tidymodels or did I imagine it?). It also seems like there are other packages that do machine learning on R, like mlr3, but right now I'm leaning somewhat towards tidymodels because it seems so much in tune with tidyverse and because its precursor caret already looks quite good. @emilycantrell do you have a favorite package in mind?

emilycantrell commented 7 months ago

@HanzhangRen Thank you! I agree with the tidymodels choice for the reasons you outlined. I have not personally used it much, but I like that it works well with tidyverse, and I think I will likely also use it in my other project.

Handling temporally-shifted data in CV

We need to think about how to handle the temporally-shifted data in cross-validation. By "temporally-shifted data", I mean the data from prior years that we are shifting forward. Last year, my team and I put the true data and the temporally-shifted data all in one dataset, then conducted cross-validation with it. However, I think this might have given us misleading CV results, because temporally-shifted data was in our CV test folds. To more accurately estimate the performance on 2021-2023 outcomes, I think our test folds should only contain true 2021-2023 outcomes. So, here's what I propose:

Split the true data into CV folds, so that the test folds will only contain true 2021-2023 data
For each set of training folds, append the temporally-shifted data to the true data

In other words, the temporally-shifted data will be used in our training folds in CV, but not in our test folds. Then of course, we will also use the temporally-shifted data in the train set for the actual submissions. Does that sound good to you?

Next steps

My next step is to see if the temporally-shifted data I created last year can be easily merged with the current version of the data. If so, we can use that for the time being, because I don't think I will have time to create new temporally-shifted data this week. If the existing temporally-shifted data can't be merged in easily, then we can just focus on building a model with the regular data for now, and add in temporally-shifted data for a future submission.

HanzhangRen commented 7 months ago

My next step, however, will probably be to translate the code I wrote from caret to tidymodels.

This is completed!

emilycantrell commented 6 months ago

Leakage in CV

It occurred to me that we might have leakage in our cross-validation, if people from the same household are split across folds. Lisa and Gert handled this correctly when making the holdout set (see the paragraph starting with "An important consideration in creating training and holdout data is how to deal with participants from the same household" in the paper).

I believe we have some people from the same household in the LISS panel, as this page says "It consists of 5,000 households, comprising approximately 7,500 individuals".

We can get the household ID from PreFer_train_background_data.csv. It is called "nohouse_encr". We could split into folds based on household, and then assign people to the fold in which their household was placed. However, this might raise complications with folds being slightly different sizes because households are different sizes, which seems minor but could be an issue given our small sample size.

I can work on this after I finish with the time-shifting of the data.

Consider stratified sampling when making CV folds?

Another question: we could also consider stratifying based on the outcome variable when we make CV folds. The vfold_cv function you used has a "strata" option that would be easy to implement, though we might end up not using that function if we are assigning CV folds by household rather than by person. I think this would be a nice-to-have but not a must-have.

emilycantrell commented 6 months ago

Another possible source of leakage

Maybe: When we time-shift data, we should stratify it so that the people who are in the test CV fold don't have an earlier version of themselves in the training CV folds.

We think this is probably not an issue, but better safe than sorry.

Emily's edit after thinking on it: I'm still not convinced that this is a problem, and a downside to applying this rule is that we have to give up some of our data when sample size is already limited.

emilycantrell commented 6 months ago

The commit above fixes the household leakage issue.

HanzhangRen commented 6 months ago

Another possible source of leakage

Maybe: When we time-shift data, we should stratify it so that the people who are in the test CV fold don't have an earlier version of themselves in the training CV folds.

We think this is probably not an issue, but better safe than sorry.

Emily's edit after thinking on it: I'm still not convinced that this is a problem, and a downside to applying this rule is that we have to give up some of our data when sample size is already limited.

I implemented a piece of code that does this. The time shift added about 1852 rows to our original full training set of 987 (about 198 of these original people in each fold). Removing earlier versions of test fold people AS WELL AS THEIR PARTNERS from the corresponding training fold removes fewer than 200 people from each training fold. This means that we still have a net addition of 1600+ people to each training fold as a result of the time shift. It gave me a little peace of mind to know that we have done our best to reduce leakage, but I'm happy to test out how adding those people back in affects our prediction.

In this process, I needed to use individual IDs to link the time-shifted data to current household ID information from PreFer_train_background_data.csv, so I reversed the code that appends "20182020" to time shifted individual IDs. I will send along an updated version of outcome_time_shift.Rmd reflecting this change. feature_time_shift.R is also changed accordingly.

Also, now that both the original and time-shifted data need to be matched with household ID information from PreFer_train_background_data.csv, I moved the matching process from training.R to clean_df() submission.R. In the original version of clean_df() provided by the organizing team, there was already an unused argument left open for the background data, so I thought we might as well use that.

Additional note: @emilycantrell As I was going through your edits to training.py, it seems that there may be some misunderstanding about what recipe() does, largely as a result of my confusing comment in the code. Simply calling recipe() and the steps that follow does not actually do preprocessing on the data. It only sets up a function that defines what preprocessing to do. The preprocessing is done only when the recipe object is called in hyperparameter grid tuning. The recipe takes the mean of the training fold and impute missing values in both the training and test folds with the mean of the training fold.

emilycantrell commented 6 months ago

That sounds great, thank you! Given that it turned out to be a relatively small amount of data lost, I am on board with the change you made.

emilycantrell commented 6 months ago

Also, thank you for the clarification about recipe, that makes sense.

HanzhangRen commented 2 months ago

In the code we submitted to the leaderboard the last time, there are 635 people with missing household IDs. For about 152 of them, this is likely unproblematic because these 152 people are time-shifted individuals who are out of age range in 2020. There is not going to be leakage across the same person over time because they will not be in non-time-shifted data. It is also unlikely that there would be leakage between partners because these people are unlikely to have partners in the PreFer training set, as otherwise they would have shown up in the background data, and their household IDs would have been available.

For the remaining 483 people with missing household IDs, their household IDs are missing because they did not start filling in the background questionnaires until after 2017 (though most filled out the core survey in earlier years). Because this line of code in our last model restricted background data to 2017 or before, we would not be able to match them with a household ID using background data. This reveals a leakage problem that applies not just to those with NA household IDs, but also to people whose household IDs has changed over time. The fact that we use 2017 background data for time-shifted individuals can lead to leakage across the same person over the years (and, by extension, leakage across two partners whose data are collected in different time periods). Two rows referring to the same person may have different household IDs (or, if there was no 2017 data, one of the household IDs is NA, which would lead to the same consequences). The way we split our data based on household IDs cannot ensure that individuals (and households) will not straddle the training and test folds.

I have attempted to fix this issue by using 2020 household IDs for both time-shifted and non-time shifted data. This is not a perfect fix that replicates how Lisa split the registry data, but it's something I can do without significant code change.

In any case, the impact of this on the F-1 score could be small. The current F-1 score is 7963717. Without the edits, it's 7909352. If the leakage truly mattered, I would have expected the second number to be larger than the first.