Come up with strategy of selecting LVs from Random Forest

sgosline commented 4 years ago

Seeking comments from @jaybee84 but would like some broader feedback. Since we're now using the random forest to select latent variables for further analysis, I'd like to determine what our threshold is. Currently we've selected the 'top 10' by importance for each tumor type, and I've done some sort of hack to take the mean importances scores and get the top 10 of those.

Is there some way to get the minimum variables needed to predict? It can be more than 10... But I'd like to use a concrete metric to base our downstream analysis, which includes: 1- correlation with immune (20) 2- correlation with MetaViper 3- search for gene variants that predict LV differences 4- interrogation of recount2data

Is there some standard 'importance' threshold that we can use? Any other ideas?

jaybee84 commented 4 years ago

We may be able to use the "Mean decrease in Gini index" or "Mean decrease in accuracy" measure to set a threshold. The subjective way to do it seems to be to look at the plot of these indices and choose the variables for which there is the greatest decrease in these indices.

Currently Caret lets me plot the indices for single forests. I am trying to see if we can get the values of these indices out from our ensemble forest data...

sgosline commented 4 years ago

OK. Maybe also check in with the literature to see what others have done.

jaybee84 commented 4 years ago

So after a quick look through the literature, it seems that there are multiple answers to this question:

Some used two or more different algorithms and then took the common features that showed up in both as high scorers (https://insights.ovid.com/crossref?an=00005792-201910250-00069, https://plantmethods.biomedcentral.com/articles/10.1186/s13007-019-0508-7)
Some used Gini index or MDA from the highest performing forest (https://www.ncbi.nlm.nih.gov/pubmed/31746764)
Some used top X features (or union of features) in different runs of the forest (https://www.ncbi.nlm.nih.gov/pubmed/31723409.2, https://onlinelibrary.wiley.com/doi/full/10.1111/eva.12524)

jaybee84 commented 4 years ago

One possible way to select features that I tried (a hybrid of the above mentioned features):

Select top 100 important features according to median importance scores from the initial 500 iteration of RFs
Use only the 100 features(LVs) to run RFs (500 iterations) and compare median F1 scores to the initial distribution of F1 scores.
If the median F1 scores are similar to or better than earlier F1 scores, we can select those features as most important LVs.

Using this strategy, the median scores for most classes seemed to improve compared to the initial ensemble model (which used all LVs). So maybe we can use top 100 important LVs for the subsequent correlation analyses.

Thoughts?

cgreene commented 4 years ago

It's not inherently clear to me from your description: is it possible you're overfitting in the second stage? Are you doing some sort of holdout that you're evaluating on at the end?

sgosline commented 4 years ago

Yeah, I think any of the previously-published methods seem suitable for selecting LVs to focus on. If we select the top 100 for each tumor type how many overlap? How many are only present for a single tumor type?

jaybee84 commented 4 years ago

@cgreene : In both cases of running 500 iterations of RFs, I am using a training set (75% of data, with 5-fold crossvalidation) and a hold-out testing set (25% of dataset), both generated independently with random sampling at each iteration.

jaybee84 commented 4 years ago

Edit to earlier post: I selected top 40 features (not top 100 features) from each class since the mean decrease in Gini from the initial forest plateaued after top 40 features.

Apologies for the confusion.

jaybee84 commented 4 years ago

@sgosline : If we select top 40 features for each tumor type, total number of unique features is 103. 3 features are common to all four classes, 59 total features are unique to a single tumortype, 41 are found in 2 or more.

sgosline commented 4 years ago

Perfect, this sounds well-reasoned! Perhaps we can prune syn21222255 or create a new synapse table with these 103 LVs, and use those for all downstream analysis.

jaybee84 commented 4 years ago

Updated list of top Latent Vars: https://www.synapse.org/#!Synapse:syn21315356/tables/

cgreene commented 4 years ago

A common mistake that people make when using these approaches is that they filter to more and more "relevant" features by examining performance without holding out a set for the entire evaluation. From the description that you've given it seems like this might be happening here. Can you run this with a set that is entirely held out for evaluation until the final model is constructed?

On Mon, Nov 25, 2019, 4:58 PM Jineta Banerjee notifications@github.com wrote:

@cgreene https://github.com/cgreene : In both cases of running 500 iterations of RFs, I am using a training set (75% of data, with 5-fold crossvalidation) and a hold-out testing set (25% of dataset) generated independently with random sampling at each iteration.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Sage-Bionetworks/NF_LandscapePaper_2019/issues/68?email_source=notifications&email_token=AAEEPM24OUXJYQX33V3ZMXLQVRDAHA5CNFSM4JP2FAYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFD6ABI#issuecomment-558358533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEEPM6H2D4IWXCIJW476S3QVRDAHANCNFSM4JP2FAYA .

sgosline commented 4 years ago

@cgreene I believe @jaybee84 is holding out test data for evaluation (though our dataset is admittedly small).

jaybee84 commented 4 years ago

@cgreene: Apologies for any ambiguity earlier. @sgosline is correct in noting that I have been using holdout test data for evaluation of the ensemble of random forests.

What I have done here is that instead of building one final model, I have built 500 final models, where each model was tested with a holdout test dataset (generated by random sampling at each iteration). This gave us a distribution of all possible F1 scores (from good models and bad). Since our focus was on finding important features rather than get the best model, we wanted to get a distribution of importance scores so that we can build confidence intervals of importance for each feature.

Having said that, your point about holding out a test set (that is completely unseen by any of the earlier models) is also well taken. So I took the time to rerun the analyses (hence my delayed response), where I again generated the 500 final models but then tested them with a completely naive test set (that was held out from the very beginning). The analysis is here. With this approach, restricting the feature set to top 40 features seemed to improve the median F1 scores even further (than my earlier analyses).

sgosline commented 4 years ago

Awesome, does this change the top40 lists at all? Also, can you please update syn21315356 to indicate, for each LV, which disease it was important to predict - something like this

LV	All	cNF	pNF	MPNST	NF
lv1	x	x

I need it for my metaviper analysis.

jaybee84 commented 4 years ago

@sgosline : Updated table here

This has the LVs that were selected in the most recent RF ensemble (#70)

Sage-Bionetworks / NF_LandscapePaper_2019

Come up with strategy of selecting LVs from Random Forest #68