Open emmamendelsohn opened 3 weeks ago
So by specifying area in the interaction constraints, we are forcing xgboost to either split on area alone or to split on a mix of the other explanatory variables. That then means that the influence of all the other variables is independent of area, right? That seems kind of cool.
Current status (2024-06-28): we have a workflow for model splitting and fitting using
tidymodels
. There is some commented out code to create Ceteris Paribus profiles (https://github.com/ecohealthalliance/open-rvfcast/blob/feature/outbreak-layer/_targets.R#L575-L623). I think this code is working.We still need to set up a target that selects the best parameters after cross-validation. This should be doable through
tidymodels
. Then we need to fit the final version of the model.Something to look into: it's unclear whether
tidymodels
feeds the interaction constraints into thexgboost
call (https://github.com/ecohealthalliance/open-rvfcast/blob/feature/outbreak-layer/R/model_specs.R#L25). You can potentially check this by extracting the model object from tidymodels and inspecting it. Otherwise you can look at the ceteris parabus plots - the lines should be fully parallel for the variablearea
, which is the variable that has the constraint on it. If the constraint is not working as expected, you may need to lift the workflow out oftidymodels
.As a conceptual note, we're including the interaction constraint to prevent area from interacting with other variables, as a way to normalize results to polygon area size. TBH, I'm struggling with the logic behind this. To me, it seems like splitting on area still enforces the relationship that greater area -> greater outbreak probability? Or perhaps the idea is that, because the area splits are independent of the other variables, the model basically generates predictions for every "level" (as defined by the splits) of area?
Below are some notes on addressing the rarity of first outbreaks. WAHIS includes the first outbreak point and subsequent outbreaks that are part of the same event. Below we have discussed ways to handle this, but I don't think it's an immediate priority.
Relevant papers on spatial models.