ecohealthalliance / open-rvfcast

Wellcome Open RVFCast project repository
Other
0 stars 1 forks source link

Create a first version of the model #95

Open emmamendelsohn opened 3 weeks ago

emmamendelsohn commented 3 weeks ago

Current status (2024-06-28): we have a workflow for model splitting and fitting using tidymodels. There is some commented out code to create Ceteris Paribus profiles (https://github.com/ecohealthalliance/open-rvfcast/blob/feature/outbreak-layer/_targets.R#L575-L623). I think this code is working.

We still need to set up a target that selects the best parameters after cross-validation. This should be doable through tidymodels. Then we need to fit the final version of the model.

Something to look into: it's unclear whether tidymodels feeds the interaction constraints into the xgboost call (https://github.com/ecohealthalliance/open-rvfcast/blob/feature/outbreak-layer/R/model_specs.R#L25). You can potentially check this by extracting the model object from tidymodels and inspecting it. Otherwise you can look at the ceteris parabus plots - the lines should be fully parallel for the variable area, which is the variable that has the constraint on it. If the constraint is not working as expected, you may need to lift the workflow out of tidymodels.

As a conceptual note, we're including the interaction constraint to prevent area from interacting with other variables, as a way to normalize results to polygon area size. TBH, I'm struggling with the logic behind this. To me, it seems like splitting on area still enforces the relationship that greater area -> greater outbreak probability? Or perhaps the idea is that, because the area splits are independent of the other variables, the model basically generates predictions for every "level" (as defined by the splits) of area?

Below are some notes on addressing the rarity of first outbreaks. WAHIS includes the first outbreak point and subsequent outbreaks that are part of the same event. Below we have discussed ways to handle this, but I don't think it's an immediate priority.

Relevant papers on spatial models.

n8layman commented 7 hours ago

So by specifying area in the interaction constraints, we are forcing xgboost to either split on area alone or to split on a mix of the other explanatory variables. That then means that the influence of all the other variables is independent of area, right? That seems kind of cool.