Closed Junfang closed 5 years ago
Hi @Junfang
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: BioMM
Type: Package
Title: BioMM: Biological-informed Multi-stage Machine learning framework for phenotype prediction using omics data
Version: 0.99.0
Date: 2019-01-04
Authors@R: c(person("Junfang", "Chen", role=c("aut", "cre", "cph"), email="junfang.che33@gmail.com"),
person("Emanuel", "Schwarz", role=c("aut")))
Author: Junfang Chen and Emanuel Schwarz
Maintainer: Junfang Chen <junfang.chen33@gmail.com>
Description: The identification of reproducible biological patterns from high-dimensional omics data is a key factor in understanding the biology of complex disease or traits. Incorporating prior biological knowledge into machine learning is an important step in advancing such research. We have proposed a biologically informed multi-stage machine learing framework termed BioMM specifically for phenotype prediction based on omics-scale data where we can evaluate different machine learning models with various prior biological meta information.
Imports:
stats, utils, grDevices, lattice, parallel, doParallel, glmnet, rms, nsprcomp, ranger, e1071, variancePartition, ggplot2
Depends: R (>= 3.5.0)
Suggests: BiocStyle, knitr, RUnit, BiocGenerics
VignetteBuilder: knitr
biocViews: Genetics, Classification, Regression, Pathways, GO, Software
Encoding: UTF-8
LazyData: true
License: GPL-3
RoxygenNote: 6.1.1
Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Thanks for the report, "BioMM.Rproj" is removed now.
Remember to increment the version number (to 0.99.1, 0.99.2, ...) to trigger automatic builds.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
5a40ba1 updated R version dependency Depends: R (>= 3.5.2...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
cc6139f Updated version
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
If you have Authors@R
section you don't need the Author
and
Maintainer
section. This is redundant.
Indentation matters even in the DESCRIPTION file.
ok
Please name your files appropriately. It is not convention to call them "1.dataProcess.R". Just call it "dataProcess.R" if needed, there is no need for numerical encoding. Do this for all your files.
Indent your code appropriately. Use formatR, and {
where ever you
have if
statments. This makes your code easier to read for the
reviewer.
omics2genelist <- function(data, featureAnno, restrictUp=500, restrictDown=5){
-
- if ( colnames(data)[1] != "label" )
- stop("The first column of the 'data' must be the 'label'!")
+
+ if ( colnames(data)[1] != "label" ) {
+ stop("The first column of the 'data' must be the 'label'!")
+ }
dataX <- data[,-1]
seq_along
http://bioconductor.org/developers/how-to/efficient-code/
- for (i in seq_len(length(genes))) {
+ for (i in seq_along(genes)) {
Remove commented out lines of code, unless it's part of documentation.
You don't need to use which
here,
- for (i in seq_len(length(genes))) {
- # print(paste0('Gene: ', i))
- annoSub <- probeAnno[which(probeAnno[,"entrezID"] == genes[i]),]
+ for (i in seq_along(genes)) {
+ annoSub <- probeAnno[probeAnno[,"entrezID"] == genes[i],]
or here, (which isn't needed), because it creates a logical vector which acts as your index.
geneFilteredIndex <- which(geneSize >= restrictDown &
geneSize <= restrictUp)
Evaluate through out your code where you use functions which are not needed.
print()
)http://bioconductor.org/developers/how-to/coding-style/
End-User messages
message() communicates diagnostic messages (e.g., progress during lengthy computations) during code evaluation.
warning() communicates unusual situations handled by your code.
stop() indicates an error condition.
cat() or print() are used only when displaying an object to the user, e.g., in a show method.
BiocParallel::bplapply()
instead of mclapply
in your code base. Change installation method from install_github
to BiocManager
.
Please explain why there are so many unevaluated R chunks? (i.e eval=FALSE
in the code?)
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
65e7829 updated version
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Dear @nturaga Thank you very much for your review! Please see the response below:
BioMM review
R CMD build
- ok
R CMD INSTALL
- ok
DESCRIPTION
- If you have
Authors@R
section you don't need theAuthor
andMaintainer
section. This is redundant.
Authors@R section is removed.
- Indentation matters even in the DESCRIPTION file.
Fixed.
NAMESPACE
ok
R
- Please name your files appropriately. It is not convention to call them "1.dataProcess.R". Just call it "dataProcess.R" if needed, there is no need for numerical encoding. Do this for all your files.
Fixed.
- Indent your code appropriately. Use formatR, and
{
where ever you haveif
statments. This makes your code easier to read for the reviewer.
omics2genelist <- function(data, featureAnno, restrictUp=500, restrictDown=5){ - - if ( colnames(data)[1] != "label" ) - stop("The first column of the 'data' must be the 'label'!") + + if ( colnames(data)[1] != "label" ) { + stop("The first column of the 'data' must be the 'label'!") + } dataX <- data[,-1]
- If you have a vector on a certain length, use
seq_along
seq_len(length()) was changed to seq_along())
http://bioconductor.org/developers/how-to/efficient-code/
- for (i in seq_len(length(genes))) { + for (i in seq_along(genes)) {
- Remove commented out lines of code, unless it's part of documentation.
Done.
- You don't need to use
which
here,- for (i in seq_len(length(genes))) { - # print(paste0('Gene: ', i)) - annoSub <- probeAnno[which(probeAnno[,"entrezID"] == genes[i]),] + for (i in seq_along(genes)) { + annoSub <- probeAnno[probeAnno[,"entrezID"] == genes[i],]
or here, (which isn't needed), because it creates a logical vector which acts as your index.
geneFilteredIndex <- which(geneSize >= restrictDown & geneSize <= restrictUp)
Evaluate through out your code where you use functions which are not needed.
Removed ‘which’ wherever not needed.
- To show end-user messages, please follow this (i.e don't use
print()
)
All end-user messages were now shown by message() instead of print().
http://bioconductor.org/developers/how-to/coding-style/
End-User messages message() communicates diagnostic messages (e.g., progress during lengthy computations) during code evaluation. warning() communicates unusual situations handled by your code. stop() indicates an error condition. cat() or print() are used only when displaying an object to the user, e.g., in a show method.
- Use
BiocParallel::bplapply()
instead ofmclapply
in your code base.
Done.
vignettes
- Change installation method from
install_github
toBiocManager
.
Done.
- Please explain why there are so many unevaluated R chunks? (i.e
eval=FALSE
in the code?)
The only reason was to reduce the running time (a couple of hours required depending on the number of cores used for parallel computing).
Not all print
statements are removed. Please check resultReport.R
. There are still a few print
statements left behind.
Another piece of code which is unusual, what is the functionality of print(plot(...))
. This is unnecessary.
There are still some which
statements you don't need. which
creates a numeric index, which is not needed when subsetting a vector.
Please use a smaller example in the vignette using some mock data or example data, to showcase the functionality. The vignette should comprise of test use-cases which the user can successfully run and evaluate.
You may also include the examples you have in a section, which say that these examples require significant compute resources.
bplapply
goes, you can consider also giving the user the choice of BPPARAM to use, if the function can be parallelized. BIocParallel
also does the platform type checking for you already, you do not need to check if(.Platform$OS.type
to choose between windows
and Linux based platforms. The function should be ideally implemented with this interface in mind (from a user stand point).
param <- MulticoreParam(workers = 2)
pred_test <- predByCV(data, repeats, ....., BPPARAM = param)
The function could look like,
predByCV <- function(data, repeats, nfolds, FSmethod, cutP, fdr, FScore,
classifier, predMode, paramlist,
BPPPARAM = MulticoreParam())
{
cvMat <- c()
replist <- seq_len(repeats)
for (reps in replist) {
foldlists <- split(sample(nrow(data)), rep(seq_len(nfolds),
length = nrow(data)))
cvY <- rep(NA, nrow(data))
cvEstimate <- bplapply(foldlists, function(fold) {
trainData <- data[-fold, ]
testData <- data[fold, ]
predTest <- predByFS(trainData, testData, FSmethod, cutP, fdr,
FScore, classifier, predMode, paramlist)
}, BPPARAM = BPPARAM)
cvY[unlist(foldlists)] <- unlist(cvEstimate)
cvMat <- cbind(cvMat, cvY)
}
Please make similar changes wherever applicable in other places you've used BiocParallel.
return
calls at the end of your functions. R automatically returns as long as you declare the variable at the end of the function.foo <- function(x = 2) {
x* 2
}
Dear @nturaga Thanks a lot for your review. I just have a question regarding the usage of 'MulticoreParam' on different platforms. Do I have to indicate the platform type in the examples of the documentation (.Rd files), also in the vignette? so that my package could pass all three Bioconductor platforms. As far as I have tested on windows, 'MulticoreParam' could not dispatch to 'Serial'.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Dear @nturaga Many thanks again, please see my reply below.
Update
R
- Not all
resultReport.R
. There are still a fewprint(plot(...))
. This is unnecessary.
Now 'print' statements are removed, except that print(plot(...)) is still used. The reason is that I have to call print() explicitly to draw a plot inside a function when using ggplot2 plot. See also https://www.rdocumentation.org/packages/ggplot2/versions/3.1.0/topics/print.ggplot
- There are still some
which
statements you do not need.which
creates a numeric index, which is not needed when subsetting a vector.
Unnecessary 'which' statements were now removed. However, 'which' statements in exploreStat.R
are not removed since its derived numeric index will be used for computing the overlapping numeric index from another vector. e.g. See line 126, line 142 in exploreStat.R
.
- Please use a smaller example in the vignette using some mock data or example data, to showcase the functionality. The vignette should comprise of test use-cases which the user can successfully run and evaluate.
You may also include the examples you have in a section, which say that these examples require significant compute resources.
Thank you for your suggestion. The vignette is now updated using smaller datasets and models. In addition, a subsection was included to describe the computational consideration.
- As far as
bplapply
goes, you can consider also giving the user the choice of BPPARAM to use, if the function can be parallelized.BIocParallel
also does the platform type checking for you already, you do not need to checkif(.Platform$OS.type
to choose betweenwindows
and Linux based platforms.The function should be ideally implemented with this interface in mind (from a user stand point).
param <- MulticoreParam(workers = 2) pred_test <- predByCV(data, repeats, ....., BPPARAM = param)
The function could look like,
predByCV <- function(data, repeats, nfolds, FSmethod, cutP, fdr, FScore, classifier, predMode, paramlist, BPPPARAM = MulticoreParam()) { cvMat <- c() replist <- seq_len(repeats) for (reps in replist) { foldlists <- split(sample(nrow(data)), rep(seq_len(nfolds), length = nrow(data))) cvY <- rep(NA, nrow(data)) cvEstimate <- bplapply(foldlists, function(fold) { trainData <- data[-fold, ] testData <- data[fold, ] predTest <- predByFS(trainData, testData, FSmethod, cutP, fdr, FScore, classifier, predMode, paramlist) }, BPPARAM = BPPARAM) cvY[unlist(foldlists)] <- unlist(cvEstimate) cvMat <- cbind(cvMat, cvY) }
Please make similar changes wherever applicable in other places you have used BiocParallel.
I have now adapted the usage of MulticoreParam() in all related functions.
- If you are worried about efficiency of the package, then you can try to eliminate
return
calls at the end of your functions. R automatically returns as long as you declare the variable at the end of the function.foo <- function(x = 2) { x* 2 }
Many thanks for the hint. I updated some functions accordingly.
> result <- BioMM(trainData=studyDataSub, testData=NULL,
+ stratify="chromosome", pathlistDB, featureAnno,
+ restrictUp=10, restrictDown=200, minPathSize=10,
+ supervisedStage1, typePCA="regular",
+ resample1="BS", resample2="CV", dataMode="allTrain",
+ repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1,
+ nfolds=10, FSmethod1=NULL, FSmethod2=NULL,
+ cutP1=0.1, cutP2=0.1, fdr1=NULL, fdr2=NULL, FScore=param1,
+ classifier1, classifier2, predMode1, predMode2,
+ paramlist1, paramlist2, innerCore=param2,
+ outFileA2=NULL, outFileB2=NULL)
[1] "Chromosomes: "
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "3" "4" "5" "6"
[20] "7" "8" "9"
[1] "Summary stat of # probes in each Chr: "
Min. 1st Qu. Median Mean 3rd Qu. Max.
27.00 64.00 86.00 90.86 118.00 216.00
Error in as.integer(mc.cores) :
cannot coerce type 'S4' to vector of type 'integer'
Another way of looking at this,
Quitting from lines 163-184 (BioMMtutorial.Rmd)
Error in as.integer(mc.cores) :
cannot coerce type 'S4' to vector of type 'integer'
Update the version number so that the build runs after you fix the error.
Dear @nturaga Thanks for your quick update. I tried to replicate this issue on several different machines (windows, linux and mac) but I did not get any error. I was wondering if you would have any problem when running the following example?
## Load data
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')
methylData <- readRDS(methylfile)
dataY <- methylData[,1]
## select a subset of genome-wide methylation data at random
methylSub <- data.frame(label=dataY, methylData[,c(2:2001)])
trainIndex <- sample(nrow(methylSub), 30)
trainData = methylSub[trainIndex,]
testData = methylSub[-trainIndex,]
library(ranger)
library(BiocParallel)
param1 <- MulticoreParam(workers = 1)
param2 <- MulticoreParam(workers = 20)
predY <- predByBS(trainData, testData,
dataMode='allTrain', repeats=50,
FSmethod=NULL, cutP=0.1,
fdr=NULL, FScore=param1,
classifier='randForest',
predMode='classification',
paramlist=list(ntree=300, nthreads=10),
innerCore=param2)
testY <- testData[,1]
accuracy <- classifiACC(dataY=testY, predY=predY)
print(accuracy)
Received a valid push; starting a build. Commits are:
c097a39 Updated version check if any error in the vignett...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.
Thank you for contributing to Bioconductor!
Great, thank you so much for the time and effort taken to review our package!
The master branch of your GitHub repository has been added to Bioconductor's git repository.
To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/Junfang.keys is not empty), then no further steps are required. Otherwise, do the following:
See further instructions at
https://bioconductor.org/developers/how-to/git/
for working with this repository. See especially
https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/
to keep your GitHub and Bioconductor repositories in sync.
Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at
https://bioconductor.org/checkResults/
(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("BioMM")
. The package 'landing page' will be created at
https://bioconductor.org/packages/BioMM
If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.