enriquea / feseR

feseR: Combining feature selection methods for analyzing omics data
https://github.com/enriquea/feseR
GNU General Public License v2.0
15 stars 3 forks source link

CV in FeseR combineFS #6

Closed ravichas closed 4 years ago

ravichas commented 5 years ago

Hello FeseR team:

in CombineFS pipeline, why the mincorr and maxcorr filter methods are placed outside of CV loop? I think this should also be present inside the CV loop.

Using entire dataset ( features + outcome ) for screening (multivariate, Univariate methods) to identify good(?) predictors and then using the resulting smaller dataset as input for final FS using wrapper-method/CV lead to overfitting (or bias)?

In the R code, https://github.com/enriquea/feseR/blob/master/R/fs_functions.R there is a comment that says, "In addition, it is possible to set up an external loop which operates over randomized and class-balanced test data. The process returns information from both training and testing phases." Can you explain this and show some pseudo code for this setup?

Thanks

Ravi

enriquea commented 5 years ago

Hi @ravichas,

Thanks for your comments. Bellow some thoughts.

Hello FeseR team:

in CombineFS pipeline, why the mincorr and maxcorr filter methods are placed outside of CV loop? I think this should also be present inside the CV loop.

The idea here is to compute/apply these filters before to split the data in the test and training set. Thus, you keep the same feature structure in the last phase of the algorithm. We try to keep the framework as flexible as possible and what you propose is even more complex when combining PCA+RFE-RF.

Using entire dataset ( features + outcome ) for screening (multivariate, Univariate methods) to identify good(?) predictors and then using the resulting smaller dataset as input for final FS using wrapper-method/CV lead to overfitting (or bias)?

I disagree here. I find valid this approach to remove redundant features (prefiltering) and then to do full screening for a reduced subset. This approach in my opinion boost performance (improve accuracy and make the workflow computationally efficient). For example, gene expression dataset harbour many zero (missing?) expression values across all samples. In this context, could be useful (and necessary) to apply this filtering scheme. (Example here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340279/)

In the R code, https://github.com/enriquea/feseR/blob/master/R/fs_functions.R there is a comment that says, "In addition, it is possible to set up an external loop which operates over randomized and class-balanced test data. The process returns information from both training and testing phases." Can you explain this and show some pseudo code for this setup?

This means that the function allows you to perform both nested CV (outer loop/ inner loop) and flat CV (inner loop). The nested CV has been discussed before for model selection (see for example here: http://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf). But for the most application, it shows similar performance when compared to the "typical" CV approach (here inner loop) (https://arxiv.org/abs/1809.09446). You can disable nested CV by setting extfold=1.

Figure 2 in this supplemental information shows such implementation (https://www.researchgate.net/publication/286919032_Supp_Information_S1).

Thanks

Ravi

Best wishes,

Enrique

ravichas commented 5 years ago

Enrique:

I respectfully disagree with your comments on pre-filtering and CV. Shown below is a relevant section from the most informative, popular and classic book, The Elements of Statistical Learning Data Mining, Inference, and Prediction by Drs. Trevor Hastie, Robert Tibshirani and Jerome Friedman, leaders in this area.

(The sections shown below appear in page # 264 of the book; II Edition; Corrected 12th printing; Jan 13 2017)

7.10.2 The Wrong and Right Way to Do Cross-validation Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

  1. Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels
  2. Using just this subset of predictors, build a multivariate classifier.
  3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation? Consider a scenario with N = 50 samples in two equal-sized classes, and p = 5000 quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

What has happened? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.

There is a video ( https://www.youtube.com/watch?v=S06JpVoNaA0 ) of the book authors (Drs. Trevor Hastie, Robert Tibshirani) explaining why this will lead to bias.

You have been prompt in responding to my questions about feseR package. Thank you for your time.

I hope you will consider my comments in the positive way.

Cheers, Ravi

Other useful links:

enriquea commented 5 years ago

Hi @ravichas ,

I see your point and you are right. I've read the references and could be problematic to apply these pre-filtering steps which involve the response (e.g. Univariate correlation, gain information). However, I don't see why it could be problematic for multivariate filters (those which involve only the predictors itself), but we could integrate it as well.

Would be great if you could contribute with a pull request with these changes, so we could test performance on the example datasets included in the package.

Again, thanks for your valuable comments!

Enrique

ravichas commented 5 years ago

Enrique

Thank you for your prompt response. I agree that multivariate filtering doesn't involve outcome variable, but, we still don't want to use any filtering on the whole data. After splitting and setting aside the test-set, we can do anything with the remaining training set. The way we split, can impact the filtering, that is why both filtering steps should be present inside the resampling/CV loop. We cannot pretend -in my humble opinion- that the data after filtering is our data. What do you think?

In the last few weeks, I have been experimenting with your feseR code by modifying (just commenting few lines of code) your combineFS function to skip both Uni/multivariate filters. The results (before and after filtering) for a proprietary dataset are different. The difference should not be a surprise and I want to emphasize this is not a comprehensive test. As Max Kuhn and Kjell Johnson's describe in their recent book, Feature Engineering and Selection, the difficulties involved in identifying the global solution for Feature Selection (FS)/reduction problems. I believe FS is an important topic for Omics data modeling and setting up a FS pipeline is not easy. You and team have done a good job with feseR. I would love to contribute towards the testing.

Thank you for your time

Ravi

ravichas commented 5 years ago

Enrique

One quick thing that we can try is to let users use "none" flag for both filters. From your original code (part of the code shown below), you can add a condition for (univariate == "none") to just let the loop continue rather than exiting.

If you prefer, you can do the same thing for multivariate condition.

if (univariate == "corr") { features <- filter.corr(features, class, mincorr = mincorr) } else if (univariate == "gain") { features <- filter.gain.inf(features, class, zero.gain.out = zero.gain.out) }

else {
    stop("Undefined univariate filter...")
}

I have tried this with my dataset with no issues(?). I am using parallel options (see below).

suppressMessages({ library(feseR) library(foreach) library(parallel) library(doParallel) })

ncpus = future::availableCores() cl <- makeCluster(ncpus) registerDoParallel(cl)

I will post the CPU time and memory that my job uses soon.

Thanks Ravi

ravichas commented 5 years ago

Enrique:

I use a custom data with dimensions 414 x 28357

I ran the following simulation with no filtering but with the following parameters

I ran on a NIH HPC system with the following hardware: Intel E5-2680v4; hyperthreading enabled with 256 GB memory. I asked for 1 node and ran the job with 20 threads. Job was not memory intensive; elapsed time was 59.40 min.

resultsCORR <- mcombineFS(features = features, class = class, univariate = 'none', mincorr = 0.20, multivariate = 'none', maxcorr = 0.80, wrapper = 'rfe.rf', number.cv = 10, group.sizes = seq(1, 150, 1), extfolds = 20, verbose = TRUE)

length(resultsCORR$opt.variables) [1] 69

If you can push a new version with the filter "none", I can test the datasets that you had used in the paper. Let me know.

Ravi

enriquea commented 5 years ago

Thanks for the update. I will commit the changes soon.

This number looks great I think. What about the accuracy? Are the found optimal predictors expected?

Best,

Enrique

ravichas commented 5 years ago

Enrique

Personally, I will use ROC as the metric to measure the performance. ROC is not perfect but better than accuracy. What do you think?

Ravi

enriquea commented 5 years ago

Hi @ravichas ,

I've included the discussed changes. It would great if you could test it.

Thanks!

PD. PCA will require extra work.

ravichas commented 5 years ago

Not a problem.