andybega / Blair-Sambanis-replication

Replication of 'Forecasting Civil Wars: Theory and Structure in an Age of “Big Data” and Machine Learning' by Blair and Sambanis 2020
1 stars 0 forks source link

Debug what data do the PITF Split Population models actually use #5

Closed andybega closed 4 years ago

andybega commented 4 years ago

I'm a bit confused about what data the PITF split pop models actually use.

Conceptually, B&S split the data into two pieces low and high risk cases using the PITF model predictions, then train two separate RF models that each use the escalation specification of features, and combine the two sub-population predictions again in the end.

The models in 1mo_run_escalation_split_PITF.R use data in escalation_highriskPITF_train_frame_civil_ns and escalation_lowriskPITF_train_frame_civil_ns.

So far so good, but when I look in 1mo_define_models.R, I don't see that they are created any differently from the data frames for all the other models, i.e. not using the train_highrisk and train_lowrisk data frames that are setup in +master.R (lines 71-90).

Searching for "train_lowrisk" strings in all files shows they are only in +master.R. If the high and low risk models are actually essentially two versions of the same escalation RF (i.e. same specification), then the only difference in their predictions is due to RNG variation. Which, looking at the AUC-ROC values in Table 2, might be the case. Both different by 0.01, which could be due to RNG variation.

andybega commented 4 years ago

I ran the replication code in +master.R through to the model definitions.

> escalation_highriskPITF_train_formula_civil_ns
train_DV_civil_ns ~ gov_opp_low_level + gov_reb_low_level + opp_gov_low_level + 
    reb_gov_low_level + gov_opp_nonviol_repression + gov_reb_nonviol_repression + 
    gov_opp_accommodations + gov_reb_accommodations + reb_gov_demands + 
    opp_gov_demands
> escalation_train_formula_civil_ns
train_DV_civil_ns ~ gov_opp_low_level + gov_reb_low_level + opp_gov_low_level + 
    reb_gov_low_level + gov_opp_nonviol_repression + gov_reb_nonviol_repression + 
    gov_opp_accommodations + gov_reb_accommodations + reb_gov_demands + 
    opp_gov_demands

The formula's for the split and non-split escalation model are identical, as expected.

The data frames used also appear to be identical:

> dim(escalation_highriskPITF_train_frame_civil_ns)
[1] 13776    11
> dim(escalation_lowriskPITF_train_frame_civil_ns)
[1] 13776    11
> dim(escalation_train_frame_civil_ns)
[1] 13776    11

Ok, so I think these are identical RF models being run, not the intended actual models that run on sub-portions of the full training data. Checking agains the data objects that are created in the master script, but apparently not being used:

> nrow(train)
[1] 13776
> nrow(train_highrisk)
[1] 9912
> nrow(train_lowrisk)
[1] 3360
> nrow(test)
[1] 15744
> nrow(test_highrisk)
[1] 11328
> nrow(test_lowrisk)
[1] 3840

Hmm, yeah, ok. So in the B&S paper Table 2, the "PITF Split Population" model results are actually based on predictions from two RF models running with the same formula on the same data. The only variation between the PITF Split Population model and the Escalation Only model is due to RNG variation in the trained RFs.

TBC: the PITF Split Population implementation reported in the B&S paper is actually wrong.

andybega commented 4 years ago

A bit more detail:

That weird combine() function in 1mo_run_escalation_split_PITF.R is from the randomForest package itself.

# Run random forests model for high-risk split  
train_model_highriskPITF <- randomForest(escalation_highriskPITF_train_formula_civil_ns,
                                         type = regression, importance = TRUE, 
                                         na.action = na.omit, ntree = 10000, maxnodes = 5, 
                                         sampsize = 100, 
                                         replace = FALSE, do.trace = FALSE , 
                                         data = escalation_highriskPITF_train_frame_civil_ns, forest = TRUE)                     

# Run random forests model for low-risk split
train_model_lowriskPITF <- randomForest(escalation_lowriskPITF_train_formula_civil_ns,
                                        type = regression, importance = TRUE, 
                                        na.action = na.omit, ntree = 10000, maxnodes = 5, 
                                        sampsize = 100, 
                                        replace = FALSE, do.trace = FALSE , 
                                        data = escalation_lowriskPITF_train_frame_civil_ns, forest = TRUE)

# Combine random forests models for high-risk and low-risk splits
train_model <- randomForest::combine(train_model_highriskPITF, train_model_lowriskPITF)
> train_model_highriskPITF

Call:
 randomForest(formula = escalation_highriskPITF_train_formula_civil_ns,      data = escalation_highriskPITF_train_frame_civil_ns, type = regression,      importance = TRUE, ntree = 10000, maxnodes = 5, sampsize = 100,      replace = FALSE, do.trace = FALSE, forest = TRUE, na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 10000
No. of variables tried at each split: 3

          Mean of squared residuals: 0.0007546523
                    % Var explained: 0.4
> train_model_lowriskPITF

Call:
 randomForest(formula = escalation_lowriskPITF_train_formula_civil_ns,      data = escalation_lowriskPITF_train_frame_civil_ns, type = regression,      importance = TRUE, ntree = 10000, maxnodes = 5, sampsize = 100,      replace = FALSE, do.trace = FALSE, forest = TRUE, na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 10000
No. of variables tried at each split: 3

          Mean of squared residuals: 0.0007542565
                    % Var explained: 0.45
> train_model 

Call:
 randomForest(formula = escalation_highriskPITF_train_formula_civil_ns,      data = escalation_highriskPITF_train_frame_civil_ns, type = regression,      importance = TRUE, ntree = 10000, maxnodes = 5, sampsize = 100,      replace = FALSE, do.trace = FALSE, forest = TRUE, na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 20000
No. of variables tried at each split: 3

Yep, combine() just puts together the trees into a new RF model. Nothing split pop is happening here. Only different in outcomes compared to "Escalation Only" is because we have 200,000 instead of 100,000 trees in the model and the RNG seed was not the same.

andybega commented 4 years ago

Ok, reading their description of the model again, I misunderstood what they intended to do.

The final approach is a random forest analog to split-population modeling. first compute the average PITF predicted probability for each country across all years in our training set. We define those that fall in the bottom quartile as “low risk” and the rest as “high risk.” We then run our escalation model on the high-risk and low-risk subsets separately, combining the results into a single random forest (col- umn 4). For purposes of comparison, Table 2 also reproduces the escalation model results from Table 1 (column 1) and includes results from a purely structural model composed of the four PITF predictors alone (column 5).

I thought they were separately predicting low and high-risk test cases, based on the PITF model press. That's not what they are doing. They are training two separate random forests, but intentionally combining them into one and then using that single 200,000 tree forecast to predict for the test data. This I guess kind of like having a weird additional form of randomization in the model.

I don't think this is going to work well at all. It doesn't actually use the PITF model predictions at the prediction step at all, so really the only additional thing is the weird randomization.

andybega commented 4 years ago

Get a warning when combining two trees trained on different data:

>     fitted_model <- randomForest::combine(highrisk_model, lowrisk_model)
Warning message:
In rf$predicted + rflist[[i]]$predicted * rflist[[i]]$ntree :
  longer object length is not a multiple of shorter object length

But it works, i.e. assigned a new, larger tree. Meh.

andybega commented 4 years ago

Fixed; incorporated into the paper.