Added the RF_Utilities.R script used in my random forest tutorial.

nearinj commented 5 years ago

Would be great if someone looked over this real quick and also took a peak a the tutorial that I wrote, which can be found here:

https://github.com/LangilleLab/microbiome_helper/wiki/Random-Forest-in-R-with-Large-Sample-Sizes

gavinmdouglas commented 5 years ago

I can take closer look on Monday!

gavinmdouglas commented 5 years ago

I looked through the tutorial and overall it looks good although I think it could use some more tweaks. I made a few (very minor) edits already and you can see all of my comments below. The key thing is that I ran into an error when running the key RF command, which is something that needs to be troubleshooted.

Major

I would recommend italicizing / bolding key points of the text since many readers will likely be skimming the first sections. Also I find wrapping the names of functions and objects in "`" makes them easier to read.
Should clarify in background which type of variable importance the scrambling method is - Gini importance is the default in certain implementations so this could be confusing.
When describing k-fold CV I think you mean "equal-sized" rather than "even numbered" and it would only be repeated k times and not "n times" correct? I think that would be referred to as repeated k-fold CV if the entire procedure is repeated n times.
mtry is the most important hyper-parameter, but not the only one for RF. See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (albeit not all of these parameters are available in the R implementation).
Specify that the commands should be run in an R environment and not on command-line.
Add remove_rare function to RF_Utilities.R
Need a few sentences describing what "get_rf_results" does overall - i.e. what are the main steps that this function performs? I know that you mentioned this above, but I don't think it's clear that the function performs all of the key steps currently.
"nvalues: set to TRUE to print out the expected results from this" in get_rf_results commands - I think this is a typo?
Also in that command "functioncrossrepeats" should be "ncrossrepeats" I think.
I received this error when running the "get_rf_results" command: Error in randomForest.default(x, y, mtry = param$mtry, ...) : NA not permitted in predictors In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
AUCS -> AUCs

Minor

Remove "Last Edited" date since this is shown automatically at the top of the page (and this often causes confusion if you forget to update this date).
Under "Requirements" clarify that exact versions of packages not required, but that those were the versions used when you ran the commands (also report your versions of the last 3 R packages)
Cite the study (a meta-analysis) and the github repo that produced the processed tutorial datafiles: https://github.com/cduvallet/microbiomeHD
I'm not sure it makes sense to contrast RF with "decision trees" - I think what you're thinking of is general "bagging" methods. As far as I know decision trees don't imply a single algorithm (i.e. I think it's correct to call the trees output by RF decision trees).
Plural of index is indices not indexs
Mention the purpose of this line: rownames(clean_metadata) <- gsub("-","\\.",rownames(clean_metadata))
Maybe use a few lines to describe why removing rare genera is helpful?
Maybe plot raw vs CLR-transformed values to show the difference?

gavinmdouglas commented 5 years ago

It's hard to evaluate the R code without running actual tests, but one minor thing in the Rscript is that rather than the title Script to run the main RF pipeline I think saying that it contains R functions for running RF pipelines is more accurate.

gavinmdouglas commented 5 years ago

Last thoughts:

A better name for this monolithic RF function might be "RF_pipeline" or something along those lines rather than "get_rf_results", but that's up to you.
It might be worthwhile making a really basic R package for these functions, which could be part of a different github repo (and installed with devtools). This would make the documentation and usage a lot clearer. It seems like this is pretty straight-forward based on this tutorial: https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/

gavinmdouglas commented 5 years ago

@nearinj - friendly reminder about this PR so it isn't lost to the ages.

nearinj commented 5 years ago

Just saw this, I will look into this and fix it up ASAP.

nearinj commented 5 years ago

I have went ahead an uploaded my own package onto a github repo that can be installled using devtools. I will go ahead and link to this repo within the wiki tutorial and therefore not merge the changes in this pull request.

https://github.com/nearinj/RandomForestUtils

LangilleLab / microbiome_helper

Added the RF_Utilities.R script used in my random forest tutorial. #40

Major

Minor