I am currently deciding how to deal with missing data. If a variable is sufficiently sparse (say above 15% null) and isn't correlated with the target variable then I think its best to delete that variable.
For those variables populated more than that the options are:
Delete the variable
Delete the observation (row)
Fill in the missing data
Leave the data as missing
If the variable is highly correlated with another variable and it makes intuitive sense that the variables are related then 1 seems the best option (as in the garage variables).
I'm struggling to decide when its best to do 2,3 or 4. What factors influence the decision?
Also what happens if there is missing data in the test data (which there is).
Hey everyone,
I am currently deciding how to deal with missing data. If a variable is sufficiently sparse (say above 15% null) and isn't correlated with the target variable then I think its best to delete that variable.
For those variables populated more than that the options are:
If the variable is highly correlated with another variable and it makes intuitive sense that the variables are related then 1 seems the best option (as in the garage variables). I'm struggling to decide when its best to do 2,3 or 4. What factors influence the decision?
Also what happens if there is missing data in the test data (which there is).
Cheers Daniel