BenChehade / datasciences

attempt at data science competitions - mostly kaggle
MIT License
1 stars 0 forks source link

Missing data #5

Open DataMonsterBoy opened 7 years ago

DataMonsterBoy commented 7 years ago

Hey everyone,

I am currently deciding how to deal with missing data. If a variable is sufficiently sparse (say above 15% null) and isn't correlated with the target variable then I think its best to delete that variable.

For those variables populated more than that the options are:

  1. Delete the variable
  2. Delete the observation (row)
  3. Fill in the missing data
  4. Leave the data as missing

If the variable is highly correlated with another variable and it makes intuitive sense that the variables are related then 1 seems the best option (as in the garage variables). I'm struggling to decide when its best to do 2,3 or 4. What factors influence the decision?

Also what happens if there is missing data in the test data (which there is).

Cheers Daniel