some improvements - Githubissues

enriquea commented 5 years ago

[x] Add function to remove features with high missingness rate.
[x] Add some basic imputation method (e.g. k-means).
[ ] Make easier to extract the top features contributing to the principal components explaining most of the variance in the dataset.

ravichas commented 5 years ago

Enriquea and team:

(These are questions, not issues. I guess, it is ok if I submit my queries here. If not, please let me know and I will shoot an email, thanks)

Great software, thanks for sharing. Enriquea, thanks for answering my earlier email questions. I am using the latest feseR and other related software (sessionInfo shown below) I have a couple of queries.

In the vignette, https://github.com/enriquea/feseR/blob/master/vignettes/feser.pdf, Table 2 reports the classification metrics for 20 class-balanced and randomized runs. Can you please comment on the creation of balanced (up/down/mixed/ROSE?) datasets?
For parallel runs, I am not sure how to pass the "allowParallel = TRUE" or equivalent options through your
A procedure for extracting the top-n features (I see this as the last item in your extra improvements list, thanks)

Cheers Ravi

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /usr/local/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] feseR_0.2.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.1 pillar_1.4.0 compiler_3.5.2 gower_0.2.0 [5] plyr_1.8.4 tools_3.5.2 iterators_1.0.10 class_7.3-15 [9] rpart_4.1-15 ipred_0.9-9 lubridate_1.7.4 tibble_2.1.1 [13] nlme_3.1-139 gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.2 [17] rlang_0.3.4 Matrix_1.2-17 foreach_1.4.4 prodlim_2018.04.18 [21] withr_2.1.2 stringr_1.4.0 dplyr_0.8.0.1 generics_0.0.2 [25] recipes_0.1.5 stats4_3.5.2 grid_3.5.2 caret_6.0-84 [29] nnet_7.3-12 tidyselect_0.2.5 data.table_1.12.2 glue_1.3.1 [33] R6_2.4.0 survival_2.44-1.1 lava_1.6.5 reshape2_1.4.3 [37] ggplot2_3.1.1 purrr_0.3.2 magrittr_1.5 ModelMetrics_1.2.2 [41] scales_1.0.0 codetools_0.2-16 MASS_7.3-51.4 splines_3.5.2 [45] assertthat_0.2.1 timeDate_3043.102 colorspace_1.4-1 stringi_1.4.3 [49] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

Can you explain how the class-balanced was carried out (ROSE, up/down/mixed)? Does any of the feseR protocols include the option?

drychkov commented 5 years ago

@ravichas I can answers on first two questions.

As I understand, "class-balanced" in the vignette means keeping class ratios for each run. The function for the training:testing split was used from the caret package: createDataPartition() For the class balance, it is usually not advisable to artificially create it for feature selection procedures. So it's just better to use specific metrics for benchmarking, like Kohen's Kappa or AUC.
The combineFS() function contains foreach() %dopar% {} with allowParallel = TRUE passed to caret::rfe() function. So regular

library(doSNOW) cl <- parallel::makeCluster(coreNums) registerDoSNOW(cl)

combineFS( ... )

stopCluster(cl)

(or similar) will work here.

Surely, BiocParallel's implementation would be better here, but it's not done yet.

ravichas commented 5 years ago

drychkov

Thanks very much for the detailed explanations.

Ravi

ravichas commented 5 years ago

Enrique Audain enriquea

Make easier to extract the top features contributing to the principal components explaining most of the variance in the dataset.

I am sure you are busy. I am just curious about this enhancement. Do we expect this enhancement soon? :)

Cheers Ravi

enriquea / feseR

some improvements #3