developmentseed / geospatial-ds-cholera-lab

A repo dedicated to developing a geospatial data science prototype (see issue: https://github.com/developmentseed/labs/issues/292)
10 stars 2 forks source link

Identify missing observations and imputation strategy #26

Closed kathrynberger closed 1 year ago

kathrynberger commented 1 year ago

SMOTE treatment for imbalanced datasets does not accept missing values (NaNs) for feature (column) values. So we'll first have to identify the extent of missing values for each of the 3 key feature variables (LST, precip, sm). Once we have sorted the extent (how many and how frequent) we can implement an appropriate imputation strategy for the missing data. The same strategy will be applied to all trailing month variables (e.g., lst_1, lst_2, etc)

kathrynberger commented 1 year ago

While imputing observations could make sense in some cases, there were some larger time chucks with missing data (e.g., soil moisture) which did not make sense to impute using standard forward fill or backfill methods.

Furthermore, the literature (e.g., Campbell et al., 2020 of which this work was inspired) supported keeping those records for which all environmental variables were available. This is the methodology that waws used in our scenario as well, so imputing of missing data was no longer required.