Closed fedeago closed 4 years ago
Hi @fedeago
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: NewWave
Type: Package
Title: Negative binomial model for scRNA-seq
Version: 0.2.0
Authors@R: c(person("Federico", "Agostinis", role = c("aut", "cre"),
email = "federico.agostinis@outlook.com"),
person("Chiara", "Romualdi", role = "aut"),
person("Gabriele", "Sales", role = "aut"),
person("Davide", "Risso", role = "aut"))
Description: A model designed for dimensionality reduction and batch effect
removal. It is designed to be massively parallelizable and it can be used
with different mini-batch approaches in order to reduce time consumption.
Depends:
R (>= 4.0),
methods,
SummarizedExperiment,
parallel,
SingleCellExperiment,
SharedObject(>= 1.3.15)
Imports:
irlba,
Matrix,
DelayedArray,
BiocSingular,
stats
Suggests:
testthat,
rmarkdown,
splatter,
mclust,
Rtsne,
ggplot2,
Rcpp,
BiocStyle,
knitr
License: GPL-3
VignetteBuilder: knitr
biocViews: Software,
GeneExpression,
Transcriptomics,
SingleCell,
BatchEffect
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.1
BugReports:https://github.com/fedeago/NewWave/issues
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.
Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Received a valid push on git.bioconductor.org; starting a build for commit id: 9586c19ed2c58edf256b7500d61b53bdf2812dda
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Received a valid push on git.bioconductor.org; starting a build for commit id: 5da8b4affb57735a1cf0784b81e54b374ad42066
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Received a valid push on git.bioconductor.org; starting a build for commit id: 94d6097abb4260d22d11c977c5654e249b7ddf27
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Received a valid push on git.bioconductor.org; starting a build for commit id: eda274059347c0ef2a2581344142117714950a4d
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Received a valid push on git.bioconductor.org; starting a build for commit id: 3b7265ef45e718a6b75b9e0ba274ac78e8c4c669
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Hello @fedeago -
Thank you for submitting to Bioconductor. Please see the initial review of the package below. The required changes must be made while the suggested changes do not have to be (though we strongly encourage them). Comment back here with updates that have been made and when the package is ready for a re-review.
[ ] REQUIRED: The 'Description:' field is made up by less than 3 sentences. Expand this field and structure it as a full paragraph.
[ ] REQUIRED: The 'Depends:' field should be edited. This field is for packages that provide essential functionality. It is unusual for more than 3 packages to be listed here. If they don't provide essential functionality, they should be moved to 'Imports:'.
[ ] SUGGESTION: Consider adding these automatically suggested biocViews:
[ ] REQUIRED: The vignette 'title:' field should be more informative.
[ ] REQUIRED: I would remove lines 17-24 since you repeat the code immediately after and echo it to the user.
[ ] REQUIRED: There should be an 'Introduction' section that serves as an abstract to introduce the objective, models, unique fuctions, key points, etc. that distinguish the package from other packages of similar type.
[ ] REQUIRED: There should be an 'Installation' section that shows the user
how to download the package from Bioconductor. The code should look like the
following and include eval = FALSE
.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("NewWave")
[ ] SUGGESTION: We strongly encourage a table of contents.
[ ] REQUIRED: The final section should be 'Session Information' and should
include sessionInfo()
.
@keywords internal
tag to the documentation. This will remove the function
from the documentation index.[ ] SUGGESTION: For formating reasons, consider shorter lines. There are 27 lines that are > 80 characters long.
[ ] SUGGESTION: For formating reasons, consider multiples of 4 spaces for line indents. There are 363 lines that are not.
Best, Kayla
Received a valid push on git.bioconductor.org; starting a build for commit id: 19b1d2b74ddfc9d2e770c6f9c3391c3cbc2497e5
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Received a valid push on git.bioconductor.org; starting a build for commit id: ed3c231bbc0d3854d89be44b0d45db7de6c988ec
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/NewWave
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Hello @Kayla-Morrell
Thank you for your precious observation. In inst/function.R there are all the function needed by the cluster for the estimation. This is the only way we found to be sure that no other useless variable and library present in the global environment were duplicated across all cluster's nodes causing an increasing in memory consumption, this increase is especially visible foe small and medium data set.
* [x] REQUIRED: What is the purpose of the 'inst/function.R' file?
* [x] REQUIRED: The 'Description:' field is made up by less than 3 sentences.
Expand this field and structure it as a full paragraph.
* [x] REQUIRED: The 'Depends:' field should be edited. This field is for
packages that provide essential functionality. It is unusual for more than 3
packages to be listed here. If they don't provide essential functionality, they
should be moved to 'Imports:'.
* [x] SUGGESTION: Consider adding these automatically suggested biocViews:
* Sequencing, Coverage
* [x] REQUIRED: The formating of the NEWS file is not correct. Please see the
following link for help with this:
https://bioconductor.org/developers/package-guidelines/#news.
* [x] REQUIRED: The vignette 'title:' field should be more informative.
* [x] REQUIRED: I would remove lines 17-24 since you repeat the code immediately
after and echo it to the user.
* [x] REQUIRED: There should be an 'Introduction' section that serves as an
abstract to introduce the objective, models, unique functions, key points, etc.
that distinguish the package from other packages of similar type.
* [x] REQUIRED: There should be an 'Installation' section that shows the user
how to download the package from Bioconductor. The code should look like the
following and include `eval = FALSE`.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("NewWave")
* [x] SUGGESTION: We strongly encourage a table of contents.
* [x] REQUIRED: The final section should be 'Session Information' and should
include `sessionInfo()`.
* [x] REQUIRED: Man pages are only needed for functions that are exported. For
any function that is documented but not being exported you can add the
`@keywords internal` tag to the documentation. This will remove the function
from the documentation index.
* [ ] SUGGESTION: For formating reasons, consider shorter lines. There are 27
lines that are > 80 characters long.
* [ ] SUGGESTION: For formating reasons, consider multiples of 4 spaces for
line indents. There are 363 lines that are not.
Best Federico
I don't think your analysis is correct with respect to serialization of objects in the global environment -- these are in general not serialized; instead you are seeing the cost of loading your package & dependencies on the workers. Have you investigated ways to reduce the dependency burden of your package? When you start the cluster, you could pay the one-time cost of loading your package, and then the overhead on subsequent calls would be minimal. Also, have you thought about choosing a more 'granular' parallelization strategy, where more work is done in each parallel call? This will be more efficient in general.
The basic problem with inst/script/*R files is that they are not checked at all, so that 'bit rot' can (and will!!) quickly set in.
Hi @mtmorgan ,
in my repository I created another branch called "BioconductorIssue" in which function.R is inside R folder and the functions are called by clusterApply using NewWave:::
. Running this script:
library(splatter)
library(NewWave)
params <- newSplatParams()
N=1000
set.seed(1234)
data <- splatSimulateGroups(params,batchCells=c(N/2,N/2),
group.prob = rep(0.1,10),
de.prob = 0.2,
verbose = FALSE)
set.seed(12359)
hvg <- rowVars(counts(data))
names(hvg) <- rownames(counts(data))
data <- data[names(sort(hvg,decreasing=TRUE))[1:1000],]
res2 <- newWave(data,X = "~Batch", K=10, children=10)
with both version in my computer the new version without inst/function.R
use near 500MB of memory for each job while the version with inst/function.R
uses only 150MB for each job. I think that this is a big difference and this reduction of memory consumption make that file worth to be in that folder. Another option could be create a two package, one for the main job and one for the children jobs. In this case the child package depends only on base package while the main package depends on all actual packages plus this child package. If you have any suggestion on other option I would be pleased to investigate them.
R and Bioconductor checks inst/function.R
indirectly becouse the vignette cannot run if one of these functions is broken.
Best, Federico
@mtmorgan @Kayla-Morrell,
we're experimenting with two solutions:
We can confirm that 1. works (see branch https://github.com/fedeago/NewWave/tree/TwoPack, new package at https://github.com/fedeago/deepwave), while we still need to implement and test 2. However, solution 2 has the advantage of not creating an extra package.
We will get back to you once we have explored solution 2.
I mean to get to this today, but I was thinking that you could do
funs <- local({
var1 <- ...
list(
fun1 = function(...) {},
fun2 = function() {}
)
}, env = .GlobalEnv)
although this is untested. I think a solution like this is better than a separate, minimal package. I think you would not need to export any environment; parallel evaluation of funs$fun1
would send it's environment (including var1
) to the worker.
@mtmorgan
we try to use your solution but still uses too much RAM.
In order to investigate the behavior of that implementation I created a R package that can be found in the repository https://github.com/fedeago/RAMinvestigation
. It is a minimal implementation with a single function(main(number_of_workers)
) that imports the package SummarizedExperiments, creates a cluster and than apply to that cluster a function defined as you suggested. You only need to choose the number of workers. Suddenly each workers use 400 MB of RAM, the weight of SummarizedExperiment.
Do you have any other suggestion to overcome this problem?
Hi @mtmorgan ,
the repo that @fedeago refers to in his post above is a minimal implementation of your suggestion. It still forces all the workers to load the dependency packages. Any chance you can have a look at the code and see if we're doing something obviously wrong? Otherwise, any suggestions on how to get past this?
Thanks! Davide
The local()
function needs to specify an environment outside the package namespace, e.g.,
funs <- local({
list(
example = function(k) {
Sys.sleep(10)
}
)
}, envir = baseenv())
Of course the functions used in funs$example
need to be defined in the base environment or fully qualified...
funs <- local({
list(
example = function(k) {
Sys.sleep(10)
}, lm = function(...) {
utils::data(mtcars, envir = environment())
stats::lm(mpg ~ cyl, mtcars)
}
)
}, envir = baseenv())
Also, it's not necessary to separately export the function, so main()
might look like
> RAMinvestigation::main
function (children)
{
cl <- makePSOCKcluster(children)
on.exit(stopCluster(cl), add = TRUE)
clusterApply(cl, x = seq.int(children), funs$example)
clusterApply(cl, x = seq.int(children), funs$lm)
}
<bytecode: 0x7fea7aa01ee0>
<environment: namespace:RAMinvestigation>
Hi @mtmorgan, thank you for these new details.
Suddenly these not change RAM consumption of each single worker, they continue to use 400 MB of RAM, the weight of SummarizedExperiment.
We have two implementations working, the original one with `inst\functions.R' and the second one with a package in CRAN for those functions. Is possible to have one of these approved? Which one do you consider more worthy?
Thanks Federico
Use the original approach in inst/functions.R
Thanks @mtmorgan
if you agree to accept the package with the original inst/functions.R approach, we commit to keeping exploring a better solution for a second version of the package.
It looks like this is not a straightforward things to solve, but it may be worth finding a general solution that could be implemented by the large-scale packages to keep the RAM consumption low.
But how are you measuring memory use? I was installing the package using base R tools (R CMD INSTALL .
) and base R in a console (R -e "RAMinvestigation::main(10)"
. I was looking at memory consumption with the Mac's 'Activity Monitor' or the 'top' implementation available there. I created a pull request to make sure that we are using the same code. Also I confirmed that the original code in RAMinvestigation did consume ~400MB of memory on each process, but that after the change the processed consumed only ~40MB. Also I made sure that there were no .RData files in the directory that I was running, or similar issues that might mean that an R process starts with other than base packages.
I measured it looking at the htop
of my linux machine. If you need a more detailed measure I can provide it.
I tweaked my previous comment with a little more detail; I'll run on the bioconductor/bioconductor_docker:devel image to see if we can't get to a consistent interpretation.
I updated and ran the bioconductor docker image
docker pull bioconductor/bioconductor_docker:devel
docker run -it --rm bioconductor/bioconductor_docker:devel bash
I started an R session and installed SummarizedExperiment and the repos
# R
> BiocManager::install("SummarizedExperiment")
> BiocManager::install("fedeago/RAMinvestigation")
From my host system I started another terminal and attached to the running docker image, and started top
docker ps # showed the 'names' of the image as happy_nash
docker exec -it happy_nash bash
## root@d0f4b4d8d792:/# top
Back in the original terminal and R I ran
> RAMinvestigation::main(3)
and in top I saw the processes consuming ~ 65 MB rather than 448 MB.
Can you reproduce this?
Yes, I can reproduce it and you are right. There is only one difference, I am using BiocManager 3.11 in my machine and 3.12 on docker.
I will investigate why in my machine there is this inflation in RAM consumption using the last version of RAM investigation
and then I will modify NewWave
accordingly.
Thank you for yours suggestion and your help in debugging.
Best regards, Federico
After discussion with @mtmorgan we are going to go ahead and accept the package. Thank you for your contribution!
Best, Kayla
Your package has been accepted. It will be added to the Bioconductor nightly builds.
Thank you for contributing to Bioconductor!
The master branch of your GitHub repository has been added to Bioconductor's git repository.
To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/fedeago.keys is not empty), then no further steps are required. Otherwise, do the following:
See further instructions at
https://bioconductor.org/developers/how-to/git/
for working with this repository. See especially
https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/
to keep your GitHub and Bioconductor repositories in sync.
Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at
https://bioconductor.org/checkResults/
(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("NewWave")
. The package 'landing page' will be created at
https://bioconductor.org/packages/NewWave
If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.