Closed AlineTalhouk closed 6 years ago
Sure, what time are you available to chat?
I will come see you when I come in
Meet me at 2pm in 14th floor meeting room?
Some reference that may be helpful: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287491/
and another copied over from the other issue https://academic.oup.com/bioinformatics/article/23/19/2507/185254/A-review-of-feature-selection-techniques-in
@dchiu911, can you please implement the model selection portion? I will continue on with the ensemble creation etc. I am building a pipeline right now. We can go back and fix up the code after May 11
Which specific methods do you want me to implement?
We can maybe start by doing the typical stuff. maybe the regularisation methods of gllmnet etc ? what do you think?
So you want a filter step (e.g. min var > 1) before the bootstrap data generation, and then within each bootstrap replicate, add some FS methods before classification, OR use some classification algorithms that have FS embedded?
hmm, we were planning on filtering outside this data set. However, I guess having some filtering step option could be good overall. I would not worry about it for right now. could be an improvement later
I am thinking of having a filtering function select_features
outside of the training with several options (min var, min mad, etc.) and then figure out how to incorporate embedded methods within the training.
Yes that would be reasonable. we are just going to use prepare_data for now. I would focus on implementing the embedded methods for now; I wan to put these on the cluster over the weekend
Using embedded methods is going to take some time for me to figure out how to incorporate into the current framework.
Here is what I was thinking: example lasso.
for a each bootstrap sample (ie training data re-sampled with replacement) you Optimize a model using cross validation.
the "optimal" model is tested on OOB and reported
You can think of a stepwise procedure as well in a similar fashion. What is important is that the OOB is never used to decide on the features
Here is what I was thinking: example lasso. for a each bootstrap sample (ie training data re-sampled with replacement) you Optimize a model using cross validation. the "optimal" model is tested on OOB and reported
The glmnet::cv.glmnet
function currently implemented already embeds cross-validation into the LASSO path
I've made some headway into a backwards selection algorithm, implementation pending
Thank you, let's discuss tomorrow. I had meetings all afternoon, came by around 5 you were already gone.
@dchiu911 I will come see you when I get in? are you available? what's the progress status on model validation? I will run what we have on monday so today is the cutoff. Let me know if I can help
I am using caret::rfe
to run a backwards selection algorithm for rf
and lda
currently. The only issue is that we have to specify the final number of features we want to keep a priori so the function will test each subset. Currently set to seq_len(30)
, so we can think about what to do here.
Hi @dchiu911 I am profiling the code for the cluster, what is the status of the feature selection?
implemented for lda
, qda
, rf
, embedded in lasso
, ridge
How do you call them? or do they happen by default
and have you pushed tuning of svm out of the bootstrap pipeline?
you call rfe = TRUE
in splendid
to turn on the feature selection. Default is FALSE
. No svm tuning currently operates on bootstrap training sets
Profiling results FYI:
> res %>%
+ data.frame %>%
+ t() %>%
+ set_colnames(c("memory","time"))
memory time
lda 13.6 0.34
rf 23.7 2.72
multinom 39.2 0.96
nnet 40.0 1.02
knn 62.4 0.40
svm 934.4 12.98
pam 493.1 1.90
adaboost 675.9 2.54
xgboost 0.1 0.12
nb 585.8 1.28
lasso 382.6 8.02
ridge 676.1 14.26
Profile with feature selection:
> res2 %>%
+ data.frame %>%
+ t() %>%
+ set_colnames(c("memory","time"))
memory time
lda 2393.3 9.32
qda 2374.3 9.84
rf 2486.9 9.88
lasso 2494.8 10.22
ridge 2510.6 10.00
@dchiu911 what is the prospect on doing feature selection with svm?
infeasible when tuning is incorporated
How about doing feature selection then tuning in a separate step
once the best subset is selected
yes we can try that, that's similar to what we talked about last week
sure, can you please let me know when done, and I will add it to the pipeline.
The support for using RFE on SVM has been added in ead650b5261d59ecb53de1ba3f691ec53c3e2d1f, but has certain limitations. Relevant code shown below:
mod <- suppressPackageStartupMessages(suppressWarnings(
caret::rfe(data, class, sizes = sizes[sizes %% 5 == 0],
method = "svmRadial",
rfeControl = caret::rfeControl(
functions = caret::caretFuncs, method = "cv",
number = 2))))
It takes about 2.78 minutes to run the following
data(hgsc)
class <- stringr::str_split_fixed(rownames(hgsc), "_", n = 2)[, 2]
mod <- classification(hgsc, class, algs = "svm", rfe = TRUE))
sizes
. For example, if originally sizes
is 54
, then I only choose 5, 10, ..., 50
.cv
(instead of default of 10)method
parameter to see if any other ones have run time gains.sizes
(e.g. 1 to 54) might not improve the model. The RFE algorithm might actually choose all variables as the optimal model (e.g. 321) and hence we end up getting the same predictions as not using feature selectionIn fact, when I run
data(hgsc)
class <- stringr::str_split_fixed(rownames(hgsc), "_", n = 2)[, 2]
sl_result1 <- splendid(hgsc, class, n = 2, algorithms = c("lda", "knn",
"svm"))
# With RFE feature selection
sl_result2 <- splendid(hgsc, class, n = 2, algorithms = c("lda", "knn",
"svm"), rfe = TRUE)
The sl_result1$evals
shows better performance for svm
than sl_result2$evals
ugh feature selection is too slow! stuck on cluster, taking forever
Hi @dchiu911 did you get around to increase the step size of svmRfe? I would like to start running those this weekend.
Oh right, just pushed a commit. It's better to open issues in the future, there's too many ideas that flow in and out of in-person conversations 😀
Hi Derek,
how are you doing on the feature selection implementation? shall we chat at some point tomorrow or Monday? I am testing splendid right now