Review Simulation Results

emhwebb commented 6 years ago

@andrewpbray

andrewpbray commented 6 years ago

I've added some comments and questions to the commit. In retrospect, it probably would have been easier just to put them here, so please feel free to move your responses over to this thread.

emhwebb commented 6 years ago

I'll answer the comments and questions from the commit over here:

(i) the first dataset is just a toy dataset I made to test the functions as I wrote it.

(ii) I think I ironed out the bugs in the code. I've gotten the hang of using map() and map_dfc() from the purrr package.

(iii) The simulated data that appears in the third code chunk is the same simulated dataset used in Strobl and Aurora's papers. The first simulation is with independent normals and the second simulation is with correlated normal variables. In terms of interpretation, the first table (where row entries say diff.wo.var etc.) is the variable importances from computing variable importances of random forests ran on added variable plots. The second table is the variable importance table from the full random forest model. I'm not sure why the added variable plots aren't showing up, I'll try to fix that tomorrow.

(iv) I'll look into comparing this method to the conditional permutation scheme Strobl proposes. I built in an added variable plot function, but for some reason it isn't showing up in my github commit. I should have interactions and non-linearity finished tomorrow.

(v) Part of what I've noticed is that this scheme is a bit noisy. When we train irrelevant variables on the residuals, then the random forest will try to find some signal in the data. On the other hand, for variables which have true signal with respect to the response, the method seems capable of finding that signal. What you saw in the first chunk of simulations is that when running the method on the sum of independent normal random variables, that the full random forest model that includes all predictors performed better than the added variable importance scheme. However, I think this method is decent at dealing with correlated variables compared to the original MDA variable importance. I've run some simulations on a non-linear response term, and compared to the original method, the added variable importance seems to be able to find which variables are most important. However, there is a persistent issue of noisiness. If we could figure out how to deal with irrelevant variables I think this method has real merit.

andrewpbray commented 6 years ago

To your last point, is it fair to say that the model is overfitting? If so, we should be looking in the realm of regularization. I think we have several tools that we can look to there including penalizing the loss function and adding a more stringent stopping rule for each forest. I wouldn't be surprised if there is some literature on constraining RF models.

emhwebb commented 6 years ago

I think it's possible that the model is overfitting. I think based on the results of variable importances on the added variable plots I think that the model is overfitting. I'll look into regularizing random Forests more closely. A quick google search suggests the easiest way to regularize a random forest is to increase the minimum node size, I.e., to grow an ensemble of shallow trees

emhwebb commented 6 years ago

Ah, also I'll be updating the simulation results for non-linear response and interactions this afternoon. The added variable plot scheme seems quite successful at handling correlated non-linear and correlated interaction terms. I also tried implementing the conditional Inference forest from the partykit package, but the conditional permutation scheme is too slow to run, even with only 100 trees. I looked at the documentation and it turns out that the conditional inference forest and conditional variable importance scheme is implemented in R instead of a faster language like C or Fortran.

emhwebb / reed-thesis

Review Simulation Results #2