Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

NewWave #1629

Closed fedeago closed 4 years ago

fedeago commented 4 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 4 years ago

Hi @fedeago

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: NewWave
Type: Package
Title: Negative binomial model for scRNA-seq
Version: 0.2.0
Authors@R: c(person("Federico", "Agostinis", role = c("aut", "cre"),
          email = "federico.agostinis@outlook.com"),
   person("Chiara", "Romualdi", role = "aut"),
   person("Gabriele", "Sales", role = "aut"),
   person("Davide", "Risso", role = "aut"))
Description: A model designed for dimensionality reduction and batch effect
    removal. It is designed to be massively parallelizable and it can be used
    with different mini-batch approaches in order to reduce time consumption.
Depends: 
    R (>= 4.0),
    methods,
    SummarizedExperiment,
    parallel,
    SingleCellExperiment,
    SharedObject(>= 1.3.15)
Imports: 
    irlba,
    Matrix,
    DelayedArray,
    BiocSingular,
    stats
Suggests:
    testthat,
    rmarkdown,
    splatter,
    mclust,
    Rtsne,
    ggplot2,
    Rcpp,
    BiocStyle,
    knitr
License: GPL-3
VignetteBuilder: knitr
biocViews: Software,
    GeneExpression,
    Transcriptomics,
    SingleCell,
    BatchEffect
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.1
BugReports:https://github.com/fedeago/NewWave/issues
bioc-issue-bot commented 4 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 9586c19ed2c58edf256b7500d61b53bdf2812dda

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 5da8b4affb57735a1cf0784b81e54b374ad42066

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 94d6097abb4260d22d11c977c5654e249b7ddf27

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: eda274059347c0ef2a2581344142117714950a4d

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 3b7265ef45e718a6b75b9e0ba274ac78e8c4c669

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

Kayla-Morrell commented 4 years ago

Hello @fedeago -

Thank you for submitting to Bioconductor. Please see the initial review of the package below. The required changes must be made while the suggested changes do not have to be (though we strongly encourage them). Comment back here with updates that have been made and when the package is ready for a re-review.

General package development

DESCRIPTION

NEWS

Vignette

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("NewWave")

Man pages

R code

Best, Kayla

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 19b1d2b74ddfc9d2e770c6f9c3391c3cbc2497e5

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: ed3c231bbc0d3854d89be44b0d45db7de6c988ec

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/NewWave to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

fedeago commented 4 years ago

Hello @Kayla-Morrell

Thank you for your precious observation. In inst/function.R there are all the function needed by the cluster for the estimation. This is the only way we found to be sure that no other useless variable and library present in the global environment were duplicated across all cluster's nodes causing an increasing in memory consumption, this increase is especially visible foe small and medium data set.

General package development

 * [x]  REQUIRED: What is the purpose of the 'inst/function.R' file?

DESCRIPTION

 * [x]  REQUIRED: The 'Description:' field is made up by less than 3 sentences.
   Expand this field and structure it as a full paragraph.

 * [x]  REQUIRED: The 'Depends:' field should be edited. This field is for
   packages that provide essential functionality. It is unusual for more than 3
   packages to be listed here. If they don't provide essential functionality, they
   should be moved to 'Imports:'.

 * [x]  SUGGESTION: Consider adding these automatically suggested biocViews:

   * Sequencing, Coverage

NEWS

 * [x]  REQUIRED: The formating of the NEWS file is not correct. Please see the
   following link for help with this:
   https://bioconductor.org/developers/package-guidelines/#news.

Vignette

 * [x]  REQUIRED: The vignette 'title:' field should be more informative.

 * [x]  REQUIRED: I would remove lines 17-24 since you repeat the code immediately
   after and echo it to the user.

 * [x]  REQUIRED: There should be an 'Introduction' section that serves as an
   abstract to introduce the objective, models, unique functions, key points, etc.
   that distinguish the package from other packages of similar type.

 * [x]  REQUIRED: There should be an 'Installation' section that shows the user
   how to download the package from Bioconductor. The code should look like the
   following and include `eval = FALSE`.
 if(!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
 BiocManager::install("NewWave")
 * [x]  SUGGESTION: We strongly encourage a table of contents.

 * [x]  REQUIRED: The final section should be 'Session Information' and should
   include `sessionInfo()`.

Man pages

 * [x]  REQUIRED: Man pages are only needed for functions that are exported. For
   any function that is documented but not being exported you can add the
   `@keywords internal` tag to the documentation. This will remove the function
   from the documentation index.

R code

 * [ ]  SUGGESTION: For formating reasons, consider shorter lines. There are 27
   lines that are > 80 characters long.

 * [ ]  SUGGESTION: For formating reasons, consider multiples of 4 spaces for
   line indents. There are 363 lines that are not.

Best Federico

mtmorgan commented 4 years ago

I don't think your analysis is correct with respect to serialization of objects in the global environment -- these are in general not serialized; instead you are seeing the cost of loading your package & dependencies on the workers. Have you investigated ways to reduce the dependency burden of your package? When you start the cluster, you could pay the one-time cost of loading your package, and then the overhead on subsequent calls would be minimal. Also, have you thought about choosing a more 'granular' parallelization strategy, where more work is done in each parallel call? This will be more efficient in general.

The basic problem with inst/script/*R files is that they are not checked at all, so that 'bit rot' can (and will!!) quickly set in.

fedeago commented 4 years ago

Hi @mtmorgan ,

in my repository I created another branch called "BioconductorIssue" in which function.R is inside R folder and the functions are called by clusterApply using NewWave:::. Running this script:

library(splatter)
library(NewWave)

params <- newSplatParams()
N=1000
set.seed(1234)
data <- splatSimulateGroups(params,batchCells=c(N/2,N/2),
                            group.prob = rep(0.1,10),
                            de.prob = 0.2,
                            verbose = FALSE) 
set.seed(12359)
hvg <- rowVars(counts(data))
names(hvg) <- rownames(counts(data))
data <- data[names(sort(hvg,decreasing=TRUE))[1:1000],]
res2 <- newWave(data,X = "~Batch", K=10, children=10)

with both version in my computer the new version without inst/function.R use near 500MB of memory for each job while the version with inst/function.R uses only 150MB for each job. I think that this is a big difference and this reduction of memory consumption make that file worth to be in that folder. Another option could be create a two package, one for the main job and one for the children jobs. In this case the child package depends only on base package while the main package depends on all actual packages plus this child package. If you have any suggestion on other option I would be pleased to investigate them.

R and Bioconductor checks inst/function.R indirectly becouse the vignette cannot run if one of these functions is broken.

Best, Federico

drisso commented 4 years ago

@mtmorgan @Kayla-Morrell,

we're experimenting with two solutions:

  1. Creating a minimal package (to be submitted to CRAN) with just the heavy-lifting optimization functions and minimal dependencies. NewWave would depend on this new package and only the minimal package would be sent to the workers.
  2. Creating a minimal environment in which to copy the needed functions from within the package and then exporting that minimal environment to the workers.

We can confirm that 1. works (see branch https://github.com/fedeago/NewWave/tree/TwoPack, new package at https://github.com/fedeago/deepwave), while we still need to implement and test 2. However, solution 2 has the advantage of not creating an extra package.

We will get back to you once we have explored solution 2.

mtmorgan commented 4 years ago

I mean to get to this today, but I was thinking that you could do

funs <- local({
    var1 <- ...
    list(
        fun1 = function(...) {},
        fun2 = function() {}
    )
}, env = .GlobalEnv)

although this is untested. I think a solution like this is better than a separate, minimal package. I think you would not need to export any environment; parallel evaluation of funs$fun1 would send it's environment (including var1) to the worker.

fedeago commented 4 years ago

@mtmorgan

we try to use your solution but still uses too much RAM.

In order to investigate the behavior of that implementation I created a R package that can be found in the repository https://github.com/fedeago/RAMinvestigation. It is a minimal implementation with a single function(main(number_of_workers)) that imports the package SummarizedExperiments, creates a cluster and than apply to that cluster a function defined as you suggested. You only need to choose the number of workers. Suddenly each workers use 400 MB of RAM, the weight of SummarizedExperiment.

Do you have any other suggestion to overcome this problem?

drisso commented 4 years ago

Hi @mtmorgan ,

the repo that @fedeago refers to in his post above is a minimal implementation of your suggestion. It still forces all the workers to load the dependency packages. Any chance you can have a look at the code and see if we're doing something obviously wrong? Otherwise, any suggestions on how to get past this?

Thanks! Davide

mtmorgan commented 4 years ago

The local() function needs to specify an environment outside the package namespace, e.g.,

funs <- local({
  list(
    example = function(k) {
      Sys.sleep(10)
    }
  )
}, envir = baseenv())

Of course the functions used in funs$example need to be defined in the base environment or fully qualified...

funs <- local({
  list(
    example = function(k) {
      Sys.sleep(10)
  }, lm = function(...) {
      utils::data(mtcars, envir = environment())
      stats::lm(mpg ~ cyl, mtcars)
    }
  )
}, envir = baseenv())

Also, it's not necessary to separately export the function, so main() might look like

> RAMinvestigation::main
function (children)
{
    cl <- makePSOCKcluster(children)
    on.exit(stopCluster(cl), add = TRUE)
    clusterApply(cl, x = seq.int(children), funs$example)
    clusterApply(cl, x = seq.int(children), funs$lm)
}
<bytecode: 0x7fea7aa01ee0>
<environment: namespace:RAMinvestigation>
fedeago commented 4 years ago

Hi @mtmorgan, thank you for these new details.

Suddenly these not change RAM consumption of each single worker, they continue to use 400 MB of RAM, the weight of SummarizedExperiment.

We have two implementations working, the original one with `inst\functions.R' and the second one with a package in CRAN for those functions. Is possible to have one of these approved? Which one do you consider more worthy?

Thanks Federico

mtmorgan commented 4 years ago

Use the original approach in inst/functions.R

drisso commented 4 years ago

Thanks @mtmorgan

if you agree to accept the package with the original inst/functions.R approach, we commit to keeping exploring a better solution for a second version of the package.

It looks like this is not a straightforward things to solve, but it may be worth finding a general solution that could be implemented by the large-scale packages to keep the RAM consumption low.

mtmorgan commented 4 years ago

But how are you measuring memory use? I was installing the package using base R tools (R CMD INSTALL .) and base R in a console (R -e "RAMinvestigation::main(10)". I was looking at memory consumption with the Mac's 'Activity Monitor' or the 'top' implementation available there. I created a pull request to make sure that we are using the same code. Also I confirmed that the original code in RAMinvestigation did consume ~400MB of memory on each process, but that after the change the processed consumed only ~40MB. Also I made sure that there were no .RData files in the directory that I was running, or similar issues that might mean that an R process starts with other than base packages.

fedeago commented 4 years ago

I measured it looking at the htop of my linux machine. If you need a more detailed measure I can provide it.

mtmorgan commented 4 years ago

I tweaked my previous comment with a little more detail; I'll run on the bioconductor/bioconductor_docker:devel image to see if we can't get to a consistent interpretation.

mtmorgan commented 4 years ago

I updated and ran the bioconductor docker image

docker pull bioconductor/bioconductor_docker:devel
docker run -it --rm bioconductor/bioconductor_docker:devel bash

I started an R session and installed SummarizedExperiment and the repos

# R
> BiocManager::install("SummarizedExperiment")
> BiocManager::install("fedeago/RAMinvestigation")

From my host system I started another terminal and attached to the running docker image, and started top

docker ps  # showed the 'names' of the image as happy_nash
docker exec -it happy_nash bash
## root@d0f4b4d8d792:/# top

Back in the original terminal and R I ran

> RAMinvestigation::main(3)

and in top I saw the processes consuming ~ 65 MB rather than 448 MB.

Can you reproduce this?

fedeago commented 4 years ago

Yes, I can reproduce it and you are right. There is only one difference, I am using BiocManager 3.11 in my machine and 3.12 on docker.

I will investigate why in my machine there is this inflation in RAM consumption using the last version of RAM investigation and then I will modify NewWave accordingly. Thank you for yours suggestion and your help in debugging.

Best regards, Federico

Kayla-Morrell commented 4 years ago

After discussion with @mtmorgan we are going to go ahead and accept the package. Thank you for your contribution!

Best, Kayla

bioc-issue-bot commented 4 years ago

Your package has been accepted. It will be added to the Bioconductor nightly builds.

Thank you for contributing to Bioconductor!

mtmorgan commented 4 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/fedeago.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("NewWave"). The package 'landing page' will be created at

https://bioconductor.org/packages/NewWave

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.