Feature selection - Githubissues

AlineTalhouk commented 7 years ago

Hi Derek,

how are you doing on the feature selection implementation? shall we chat at some point tomorrow or Monday? I am testing splendid right now

dchiu911 commented 7 years ago

Sure, what time are you available to chat?

AlineTalhouk commented 7 years ago

I will come see you when I come in

AlineTalhouk commented 7 years ago

Meet me at 2pm in 14th floor meeting room?

AlineTalhouk commented 7 years ago

Some reference that may be helpful: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287491/

AlineTalhouk commented 7 years ago

and another copied over from the other issue https://academic.oup.com/bioinformatics/article/23/19/2507/185254/A-review-of-feature-selection-techniques-in

AlineTalhouk commented 7 years ago

@dchiu911, can you please implement the model selection portion? I will continue on with the ensemble creation etc. I am building a pipeline right now. We can go back and fix up the code after May 11

dchiu911 commented 7 years ago

Which specific methods do you want me to implement?

AlineTalhouk commented 7 years ago

We can maybe start by doing the typical stuff. maybe the regularisation methods of gllmnet etc ? what do you think?

dchiu911 commented 7 years ago

So you want a filter step (e.g. min var > 1) before the bootstrap data generation, and then within each bootstrap replicate, add some FS methods before classification, OR use some classification algorithms that have FS embedded?

AlineTalhouk commented 7 years ago

hmm, we were planning on filtering outside this data set. However, I guess having some filtering step option could be good overall. I would not worry about it for right now. could be an improvement later

dchiu911 commented 7 years ago

I am thinking of having a filtering function select_features outside of the training with several options (min var, min mad, etc.) and then figure out how to incorporate embedded methods within the training.

AlineTalhouk commented 7 years ago

Yes that would be reasonable. we are just going to use prepare_data for now. I would focus on implementing the embedded methods for now; I wan to put these on the cluster over the weekend

dchiu911 commented 7 years ago

Using embedded methods is going to take some time for me to figure out how to incorporate into the current framework.

AlineTalhouk commented 7 years ago

Here is what I was thinking: example lasso. for a each bootstrap sample (ie training data re-sampled with replacement) you Optimize a model using cross validation.
the "optimal" model is tested on OOB and reported

AlineTalhouk commented 7 years ago

You can think of a stepwise procedure as well in a similar fashion. What is important is that the OOB is never used to decide on the features

dchiu911 commented 7 years ago

Here is what I was thinking: example lasso. for a each bootstrap sample (ie training data re-sampled with replacement) you Optimize a model using cross validation. the "optimal" model is tested on OOB and reported

The glmnet::cv.glmnet function currently implemented already embeds cross-validation into the LASSO path

dchiu911 commented 7 years ago

I've made some headway into a backwards selection algorithm, implementation pending

AlineTalhouk commented 7 years ago

Thank you, let's discuss tomorrow. I had meetings all afternoon, came by around 5 you were already gone.

AlineTalhouk commented 7 years ago

@dchiu911 I will come see you when I get in? are you available? what's the progress status on model validation? I will run what we have on monday so today is the cutoff. Let me know if I can help

dchiu911 commented 7 years ago

I am using caret::rfe to run a backwards selection algorithm for rf and lda currently. The only issue is that we have to specify the final number of features we want to keep a priori so the function will test each subset. Currently set to seq_len(30), so we can think about what to do here.

AlineTalhouk commented 7 years ago

Hi @dchiu911 I am profiling the code for the cluster, what is the status of the feature selection?

dchiu911 commented 7 years ago

implemented for lda, qda, rf, embedded in lasso, ridge

AlineTalhouk commented 7 years ago

How do you call them? or do they happen by default

AlineTalhouk commented 7 years ago

and have you pushed tuning of svm out of the bootstrap pipeline?

dchiu911 commented 7 years ago

you call rfe = TRUE in splendid to turn on the feature selection. Default is FALSE. No svm tuning currently operates on bootstrap training sets

AlineTalhouk commented 7 years ago

Profiling results FYI:

> res %>%
+ data.frame %>%
+ t() %>%
+ set_colnames(c("memory","time"))
         memory  time
lda        13.6  0.34
rf         23.7  2.72
multinom   39.2  0.96
nnet       40.0  1.02
knn        62.4  0.40
svm       934.4 12.98
pam       493.1  1.90
adaboost  675.9  2.54
xgboost     0.1  0.12
nb        585.8  1.28
lasso     382.6  8.02
ridge     676.1 14.26

AlineTalhouk commented 7 years ago

Profile with feature selection:

> res2 %>% 
+   data.frame %>% 
+   t() %>% 
+   set_colnames(c("memory","time"))
      memory  time
lda   2393.3  9.32
qda   2374.3  9.84
rf    2486.9  9.88
lasso 2494.8 10.22
ridge 2510.6 10.00

AlineTalhouk commented 7 years ago

@dchiu911 what is the prospect on doing feature selection with svm?

dchiu911 commented 7 years ago

infeasible when tuning is incorporated

AlineTalhouk commented 7 years ago

How about doing feature selection then tuning in a separate step

AlineTalhouk commented 7 years ago

once the best subset is selected

dchiu911 commented 7 years ago

yes we can try that, that's similar to what we talked about last week

AlineTalhouk commented 7 years ago

sure, can you please let me know when done, and I will add it to the pipeline.

dchiu911 commented 7 years ago

The support for using RFE on SVM has been added in ead650b5261d59ecb53de1ba3f691ec53c3e2d1f, but has certain limitations. Relevant code shown below:

mod <- suppressPackageStartupMessages(suppressWarnings(
  caret::rfe(data, class, sizes = sizes[sizes %% 5 == 0],
             method = "svmRadial",
             rfeControl = caret::rfeControl(
               functions = caret::caretFuncs, method = "cv",
               number = 2))))

It takes about 2.78 minutes to run the following

data(hgsc)
class <- stringr::str_split_fixed(rownames(hgsc), "_", n = 2)[, 2]
mod <- classification(hgsc, class, algs = "svm", rfe = TRUE))

The run time is still too slow so to speed it up, I choose only every 5 integers in sizes. For example, if originally sizes is 54, then I only choose 5, 10, ..., 50.
I only use 2 folds of cv (instead of default of 10)
Whoever has free time can try benchmarking the other types of SVM algorithms shown here and substituting into the method parameter to see if any other ones have run time gains.
The initial guess of sizes (e.g. 1 to 54) might not improve the model. The RFE algorithm might actually choose all variables as the optimal model (e.g. 321) and hence we end up getting the same predictions as not using feature selection
But SVM has generally been performing well (at least on this data) even without feature selection

dchiu911 commented 7 years ago

In fact, when I run

data(hgsc)
class <- stringr::str_split_fixed(rownames(hgsc), "_", n = 2)[, 2]
sl_result1 <- splendid(hgsc, class, n = 2, algorithms = c("lda", "knn",
"svm"))

# With RFE feature selection
sl_result2 <- splendid(hgsc, class, n = 2, algorithms = c("lda", "knn",
"svm"), rfe = TRUE)

The sl_result1$evals shows better performance for svm than sl_result2$evals

AlineTalhouk commented 7 years ago

ugh feature selection is too slow! stuck on cluster, taking forever

AlineTalhouk commented 7 years ago

Hi @dchiu911 did you get around to increase the step size of svmRfe? I would like to start running those this weekend.

dchiu911 commented 7 years ago

Oh right, just pushed a commit. It's better to open issues in the future, there's too many ideas that flow in and out of in-person conversations 😀

AlineTalhouk / splendid

Feature selection #6