citp / fertility-prediction-challenge-2024

Fertility prediction challenge
MIT License
0 stars 1 forks source link

Consider better imputation methods or alternatives to imputation #28

Open emilycantrell opened 2 months ago

HanzhangRen commented 3 weeks ago

As @emilycantrell pointed out for the CBS code, xgboost handles missing data on its own. I got rid of mean imputation, and performance seems to have become better. The F-1 score for the latest model version is 0.7963717. If I were to bring back mean imputation, the F-1 score is 0.7616652. We observe an improvement of 0.035. This seems quite substantive (but luck may play a role here).

This also brings up another question. Recall that our last leaderboard model had a F-1 score of 0.782 (I was able to reproduce this score exactly and then got a very slighlty lower score of 0.7802637 after adjusting for the overrepresentation of partnered people and reducing missingness in household IDs). If getting rid of mean imputation helps so much, then wouldn't that result in a high F-1 score above .8?

Not really. If I take mean imputation away from the model with F-1=0.7802637, the F-1 score becomes 0.7847916. There may be an increase of 0.005, much smaller than the increase of 0.035 that we observed for the latest version of the code.

Why does there seem to be such a big difference in how much getting rid of mean imputation helped? Much of this might just be random noise, but I have another theory for what might be happening in the back ground. Suppose that a tree splits the data based on variable x, where low x values correspond to a low probability of having kids and medium or high x values have a high probability of having kids. The problem with mean imputation is that while those with missing data may actually have a low probability of having kids, they may be imputed to have a medium level of x and are categorized by the tree to have a high probability of having kids. The algorithm would have to make one or two more splits based on x, or a variable with a missingness pattern similar to x, to be able to isolate the group of people who have missing data. It may be easier for these additional splits to happen in the old code than in the new code. Between the two versions, I got rid of quite a few variables that may have similar patterns of missingness. Since xgboost randomly selects a sample of features for every split, it may be more likely for trees in the old code to have a variable similar to x available for splitting. This means that mean imputation would have less of a negative impact in the old than in the new code.