emilycantrell / stork_oracle_cbs

0 stars 0 forks source link

Plan for handling missingness #2

Open emilycantrell opened 1 month ago

emilycantrell commented 1 month ago

Proposal to leave missing values as missing (at least for now)

I'm working on code for XGboost and Catboost models, and am deciding how to handle missingness. I am guessing Catboost will be the primary model type we use, so that's what I'm thinking about most right now, although of course our model choice may change depending how the models perform.

After reading about missingness in tree-based algorithms and thinking about the types of missingness in the CBS data, I'm inclined to leave the missing values as missing, with no imputation or other special handling. Here are notes about what led me to this plan:

*Caveat: This is my impression from other CBS files I've used, but I haven't seen the real data files for PreFer yet. If the missingness looks different than I expect, we can re-evaluate.

Question for the team

Does anyone have feedback on this proposal?

Blog posts

Here are a couple of relevant blog posts that test the impact of imputation in tree-based algorithms. Of course, results could be different with a different tree-based algorithm, different data, or different imputation method, but these articles might still be useful to consider:

msalganik commented 1 month ago

Thanks for the careful write-up @emilycantrell. After reading more about it, I agree. It seems like XGBoost and Catboost deal with missing data in a sensible way. Fruit on a plate!

P.S. If we try a regularized regression, then we are going to need to reconsider.