Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

BioMM: Biological-informed Multi-stage Machine learning framework for phenotype prediction using omics data #963

Closed Junfang closed 5 years ago

Junfang commented 5 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 5 years ago

Hi @Junfang

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: BioMM
Type: Package
Title: BioMM: Biological-informed Multi-stage Machine learning framework for phenotype prediction using omics data
Version: 0.99.0
Date: 2019-01-04 
Authors@R: c(person("Junfang", "Chen", role=c("aut", "cre", "cph"), email="junfang.che33@gmail.com"),
    person("Emanuel", "Schwarz", role=c("aut")))
Author: Junfang Chen and Emanuel Schwarz
Maintainer: Junfang Chen <junfang.chen33@gmail.com>
Description: The identification of reproducible biological patterns from high-dimensional omics data is a key factor in understanding the biology of complex disease or traits. Incorporating prior biological knowledge into machine learning is an important step in advancing such research. We have proposed a biologically informed multi-stage machine learing framework termed BioMM specifically for phenotype prediction based on omics-scale data where we can evaluate different machine learning models with various prior biological meta information. 
Imports:
    stats, utils, grDevices, lattice, parallel, doParallel, glmnet, rms, nsprcomp, ranger, e1071, variancePartition, ggplot2
Depends: R (>= 3.5.0)
Suggests: BiocStyle, knitr, RUnit, BiocGenerics
VignetteBuilder: knitr
biocViews: Genetics, Classification, Regression, Pathways, GO, Software
Encoding: UTF-8
LazyData: true  
License: GPL-3
RoxygenNote: 6.1.1

Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.

bioc-issue-bot commented 5 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

Junfang commented 5 years ago

Thanks for the report, "BioMM.Rproj" is removed now.

mtmorgan commented 5 years ago

Remember to increment the version number (to 0.99.1, 0.99.2, ...) to trigger automatic builds.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

27491bc incremented the version number to 0.99.1 to trig... cade68e Update NEWS

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

5a40ba1 updated R version dependency Depends: R (>= 3.5.2...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

d55f619 Delete LICENSE 84e0d48 updated R version dependency to 3.6 e06338c Update NEWS

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

cc6139f Updated version

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

007fade updated example in BioMMreconData() d75be84 Update NEWS d430ff5 Update DESCRIPTION new version

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

nturaga commented 5 years ago

BioMM review

R CMD build

R CMD INSTALL

DESCRIPTION

NAMESPACE

ok

R

 omics2genelist <- function(data, featureAnno, restrictUp=500, restrictDown=5){
-
-    if ( colnames(data)[1] != "label" )
-    stop("The first column of the 'data' must be the 'label'!")
+    
+    if ( colnames(data)[1] != "label" ) {
+        stop("The first column of the 'data' must be the 'label'!")
+    }
     dataX <- data[,-1] 

http://bioconductor.org/developers/how-to/efficient-code/

-    for (i in seq_len(length(genes))) { 
+    for (i in seq_along(genes)) { 
-    for (i in seq_len(length(genes))) { 
-        # print(paste0('Gene: ', i))   
-        annoSub <- probeAnno[which(probeAnno[,"entrezID"] == genes[i]),]  
+    for (i in seq_along(genes)) { 
+        annoSub <- probeAnno[probeAnno[,"entrezID"] == genes[i],]  

or here, (which isn't needed), because it creates a logical vector which acts as your index.

    geneFilteredIndex <- which(geneSize >= restrictDown &
                                geneSize <= restrictUp) 

Evaluate through out your code where you use functions which are not needed.

http://bioconductor.org/developers/how-to/coding-style/

End-User messages

message() communicates diagnostic messages (e.g., progress during lengthy computations) during code evaluation.
warning() communicates unusual situations handled by your code.
stop() indicates an error condition.
cat() or print() are used only when displaying an object to the user, e.g., in a show method.

vignettes

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

6a5c66a Update NEWS 705a105 updated version

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

8dd0a8f increased number of cores for parallel computing ... e4f3c4c Update NEWS 025b2da Update DESCRIPTION

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

65e7829 updated version

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

c0d9819 Update NEWS 42951b1 Update DESCRIPTION updated version

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

Junfang commented 5 years ago

Dear @nturaga Thank you very much for your review! Please see the response below:

Reply to review

BioMM review

R CMD build

  • ok

    R CMD INSTALL

  • ok

    DESCRIPTION

  • If you have Authors@R section you don't need the Author and Maintainer section. This is redundant.

Authors@R section is removed.

  • Indentation matters even in the DESCRIPTION file.

Fixed.

NAMESPACE

ok

R

  • Please name your files appropriately. It is not convention to call them "1.dataProcess.R". Just call it "dataProcess.R" if needed, there is no need for numerical encoding. Do this for all your files.

Fixed.

  • Indent your code appropriately. Use formatR, and { where ever you have if statments. This makes your code easier to read for the reviewer.
 omics2genelist <- function(data, featureAnno, restrictUp=500, restrictDown=5){
-
-    if ( colnames(data)[1] != "label" )
-    stop("The first column of the 'data' must be the 'label'!")
+    
+    if ( colnames(data)[1] != "label" ) {
+        stop("The first column of the 'data' must be the 'label'!")
+    }
     dataX <- data[,-1] 
  • If you have a vector on a certain length, use seq_along

seq_len(length()) was changed to seq_along())

http://bioconductor.org/developers/how-to/efficient-code/

-    for (i in seq_len(length(genes))) { 
+    for (i in seq_along(genes)) { 
  • Remove commented out lines of code, unless it's part of documentation.

Done.

  • You don't need to use which here,
-    for (i in seq_len(length(genes))) { 
-        # print(paste0('Gene: ', i))   
-        annoSub <- probeAnno[which(probeAnno[,"entrezID"] == genes[i]),]  
+    for (i in seq_along(genes)) { 
+        annoSub <- probeAnno[probeAnno[,"entrezID"] == genes[i],]  

or here, (which isn't needed), because it creates a logical vector which acts as your index.

    geneFilteredIndex <- which(geneSize >= restrictDown &
                                geneSize <= restrictUp) 

Evaluate through out your code where you use functions which are not needed.

Removed ‘which’ wherever not needed.

  • To show end-user messages, please follow this (i.e don't use print())

All end-user messages were now shown by message() instead of print().

http://bioconductor.org/developers/how-to/coding-style/

End-User messages

message() communicates diagnostic messages (e.g., progress during lengthy computations) during code evaluation.
warning() communicates unusual situations handled by your code.
stop() indicates an error condition.
cat() or print() are used only when displaying an object to the user, e.g., in a show method.
  • Use BiocParallel::bplapply() instead of mclapply in your code base.

Done.

vignettes

  • Change installation method from install_github to BiocManager.

Done.

  • Please explain why there are so many unevaluated R chunks? (i.e eval=FALSE in the code?)

The only reason was to reduce the running time (a couple of hours required depending on the number of cores used for parallel computing).

nturaga commented 5 years ago

Update

R

You may also include the examples you have in a section, which say that these examples require significant compute resources.

The function should be ideally implemented with this interface in mind (from a user stand point).

param <- MulticoreParam(workers = 2)
pred_test <- predByCV(data, repeats, ....., BPPARAM = param)

The function could look like,

predByCV <- function(data, repeats, nfolds, FSmethod, cutP, fdr, FScore,
                     classifier, predMode, paramlist,
                     BPPPARAM = MulticoreParam())
{
    cvMat <- c()
    replist <- seq_len(repeats)
    for (reps in replist) {
        foldlists <- split(sample(nrow(data)), rep(seq_len(nfolds),
            length = nrow(data)))
        cvY <- rep(NA, nrow(data))
        cvEstimate <- bplapply(foldlists, function(fold) {
            trainData <- data[-fold, ]
            testData <- data[fold, ]
            predTest <- predByFS(trainData, testData, FSmethod, cutP, fdr,
                FScore, classifier, predMode, paramlist)
        }, BPPARAM = BPPARAM)
        cvY[unlist(foldlists)] <- unlist(cvEstimate)
        cvMat <- cbind(cvMat, cvY)
    }

Please make similar changes wherever applicable in other places you've used BiocParallel.

foo <- function(x = 2) {
    x* 2
}
Junfang commented 5 years ago

Dear @nturaga Thanks a lot for your review. I just have a question regarding the usage of 'MulticoreParam' on different platforms. Do I have to indicate the platform type in the examples of the documentation (.Rd files), also in the vignette? so that my package could pass all three Bioconductor platforms. As far as I have tested on windows, 'MulticoreParam' could not dispatch to 'Serial'.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

97092e6 replaced the usage of MulticoreParam() the choice... f86ca36 added library(BiocParallel) to the example 74c7c31 adapted the usage of MulticoreParam() e584e0c adapted figures f44334c Update NEWS 4e5bfe0 updated version

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

Junfang commented 5 years ago

Dear @nturaga Many thanks again, please see my reply below.

Update

R

  • Not all print statements are removed. Please check resultReport.R. There are still a few print statements left behind. Another piece of code which is unusual, what is the functionality of print(plot(...)). This is unnecessary.

Now 'print' statements are removed, except that print(plot(...)) is still used. The reason is that I have to call print() explicitly to draw a plot inside a function when using ggplot2 plot. See also https://www.rdocumentation.org/packages/ggplot2/versions/3.1.0/topics/print.ggplot

  • There are still some which statements you do not need. which creates a numeric index, which is not needed when subsetting a vector.

Unnecessary 'which' statements were now removed. However, 'which' statements in exploreStat.R are not removed since its derived numeric index will be used for computing the overlapping numeric index from another vector. e.g. See line 126, line 142 in exploreStat.R.

  • Please use a smaller example in the vignette using some mock data or example data, to showcase the functionality. The vignette should comprise of test use-cases which the user can successfully run and evaluate.

You may also include the examples you have in a section, which say that these examples require significant compute resources.

Thank you for your suggestion. The vignette is now updated using smaller datasets and models. In addition, a subsection was included to describe the computational consideration.

  • As far as bplapply goes, you can consider also giving the user the choice of BPPARAM to use, if the function can be parallelized. BIocParallel also does the platform type checking for you already, you do not need to check if(.Platform$OS.type to choose between windows and Linux based platforms.

The function should be ideally implemented with this interface in mind (from a user stand point).

param <- MulticoreParam(workers = 2)
pred_test <- predByCV(data, repeats, ....., BPPARAM = param)

The function could look like,

predByCV <- function(data, repeats, nfolds, FSmethod, cutP, fdr, FScore,
                     classifier, predMode, paramlist,
                     BPPPARAM = MulticoreParam())
{
    cvMat <- c()
    replist <- seq_len(repeats)
    for (reps in replist) {
        foldlists <- split(sample(nrow(data)), rep(seq_len(nfolds),
            length = nrow(data)))
        cvY <- rep(NA, nrow(data))
        cvEstimate <- bplapply(foldlists, function(fold) {
            trainData <- data[-fold, ]
            testData <- data[fold, ]
            predTest <- predByFS(trainData, testData, FSmethod, cutP, fdr,
                FScore, classifier, predMode, paramlist)
        }, BPPARAM = BPPARAM)
        cvY[unlist(foldlists)] <- unlist(cvEstimate)
        cvMat <- cbind(cvMat, cvY)
    }

Please make similar changes wherever applicable in other places you have used BiocParallel.

I have now adapted the usage of MulticoreParam() in all related functions.

  • If you are worried about efficiency of the package, then you can try to eliminate return calls at the end of your functions. R automatically returns as long as you declare the variable at the end of the function.
foo <- function(x = 2) {
    x* 2
}

Many thanks for the hint. I updated some functions accordingly.

nturaga commented 5 years ago

Update

> result <- BioMM(trainData=studyDataSub, testData=NULL,
+                 stratify="chromosome", pathlistDB, featureAnno,
+                 restrictUp=10, restrictDown=200, minPathSize=10,
+                 supervisedStage1, typePCA="regular",
+                 resample1="BS", resample2="CV", dataMode="allTrain",
+                 repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1,
+                 nfolds=10, FSmethod1=NULL, FSmethod2=NULL,
+                 cutP1=0.1, cutP2=0.1, fdr1=NULL, fdr2=NULL, FScore=param1,
+                 classifier1, classifier2, predMode1, predMode2,
+                 paramlist1, paramlist2, innerCore=param2,
+                 outFileA2=NULL, outFileB2=NULL)
[1] "Chromosomes: "
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "21" "22" "3"  "4"  "5"  "6"
[20] "7"  "8"  "9"
[1] "Summary stat of # probes in each Chr: "
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  27.00   64.00   86.00   90.86  118.00  216.00
Error in as.integer(mc.cores) :
  cannot coerce type 'S4' to vector of type 'integer'

Another way of looking at this,


Quitting from lines 163-184 (BioMMtutorial.Rmd)
Error in as.integer(mc.cores) :
  cannot coerce type 'S4' to vector of type 'integer'

Update the version number so that the build runs after you fix the error.

Junfang commented 5 years ago

Dear @nturaga Thanks for your quick update. I tried to replicate this issue on several different machines (windows, linux and mac) but I did not get any error. I was wondering if you would have any problem when running the following example?

## Load data  
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')  
methylData <- readRDS(methylfile)  
dataY <- methylData[,1]
## select a subset of genome-wide methylation data at random
methylSub <- data.frame(label=dataY, methylData[,c(2:2001)])  
trainIndex <- sample(nrow(methylSub), 30)
trainData = methylSub[trainIndex,]
testData = methylSub[-trainIndex,]
library(ranger) 
library(BiocParallel)
param1 <- MulticoreParam(workers = 1)
param2 <- MulticoreParam(workers = 20)
predY <- predByBS(trainData, testData, 
                  dataMode='allTrain', repeats=50,
                  FSmethod=NULL, cutP=0.1, 
                  fdr=NULL, FScore=param1, 
                  classifier='randForest',
                  predMode='classification', 
                  paramlist=list(ntree=300, nthreads=10),
                  innerCore=param2)  
testY <- testData[,1]
accuracy <- classifiACC(dataY=testY, predY=predY)
print(accuracy)  
bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

c097a39 Updated version check if any error in the vignett...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.

Thank you for contributing to Bioconductor!

Junfang commented 5 years ago

Great, thank you so much for the time and effort taken to review our package!

mtmorgan commented 5 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/Junfang.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("BioMM"). The package 'landing page' will be created at

https://bioconductor.org/packages/BioMM

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.