Missing data - Githubissues

Hey everyone,

I am currently deciding how to deal with missing data. If a variable is sufficiently sparse (say above 15% null) and isn't correlated with the target variable then I think its best to delete that variable.

For those variables populated more than that the options are:

Delete the variable
Delete the observation (row)
Fill in the missing data
Leave the data as missing

If the variable is highly correlated with another variable and it makes intuitive sense that the variables are related then 1 seems the best option (as in the garage variables). I'm struggling to decide when its best to do 2,3 or 4. What factors influence the decision?

Also what happens if there is missing data in the test data (which there is).

Cheers Daniel

BenChehade / datasciences

Missing data #5