ECMWFCode4Earth / ml_drought

Machine learning to better predict and understand drought. Moving github.com/ml-clim
https://ml-clim.github.io/drought-prediction/
90 stars 18 forks source link

Review - Feature importance #94

Open jwagemann opened 5 years ago

jwagemann commented 5 years ago
tommylees112 commented 5 years ago

Could you please comment on feature importance. What did you learn from them? Were they all needed? How do you establish feature importance? How is the subsequent work affected by this

We are using Shap Values to calculate feature importance. Shap values can be used to understand what motivated a model to make certain predictions. These values operate on the local level, i.e. they tell us why the model predicted a VHI score of 99 for a specific pixel. Global importance features can then be derived by aggregating these local explanations.

The subsequent work has not yet been affected by this. We are currently using these models to interpret relationships and understand which variables are predictive of agricultural drought (VHI). However, the idea of learning the relationships can be used for feature selection and we might say that in order to speed up model training and development, only the important features are included in future models.

Have you explored spatial correlations among variables

In order to account for spatial correlations we have the option to append the values for surrounding pixels to the X data (the covariates). This should capture some of the spatial co-variability and we can increase the number of surrounding pixels that are incorporated into the model. However, we pay for the increased potential of capturing the spatial relationship with an increase in the number of variables as we increase the surrounding_pixels: int argument.

We currently therefore, limit our surrounding_pixels to 1, meaning that we have all variables in the 9 surrounding pixels that we have in our target pixel. This is a necessary compromise given computational constraints.

However, it is worth noting that the spatial relationships among these 9 input points aren’t communicated to the model - this is something we are considering for future iterations, perhaps using CNNs.

We are also looking to explore how feature importance varies over space. We have already explored the importance of features over time using Shapley Values.