Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

Reproducible GSEA Benchmarking #630

Closed lgeistlinger closed 6 years ago

lgeistlinger commented 6 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 6 years ago

Hi @lgeistlinger

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: GSEABenchmarkeR
Type: Package
Title: Reproducible GSEA Benchmarking
Version: 0.99.0
Author:
    Ludwig Geistlinger [aut, cre],
    Gergely Csaba [aut],
    Mara Santarelli [ctb],
    Lucas Schiffer [ctb],
    Marcel Ramos [ctb],
    Levi Waldron [ctb],
    Ralf Zimmer [aut]
Maintainer: Ludwig Geistlinger <Ludwig.Geistlinger@sph.cuny.edu>
Description:
    The GSEABenchmarkeR package implements an extendable framework for
    reproducible evaluation of set- and network-based methods for enrichment
    analysis of gene expression data. This includes support for the efficient
    execution of these methods on comprehensive real data compendia (microarray
    and RNA-seq) using parallel computation on standard workstations and
    institutional computer grids. Methods can then be assessed with respect to
    runtime, statistical significance, and relevance of the results for the
    phenotypes investigated.
URL: https://github.com/waldronlab/GSEABenchmarkeR
BugReports: https://github.com/waldronlab/GSEABenchmarkeR/issues
License: Artistic-2.0
Encoding: UTF-8
LazyData: true
Depends:
    R(>= 3.5.0),
    Biobase,
    SummarizedExperiment
Imports:
    AnnotationDbi, 
    AnnotationHub,
    BiocParallel,
    edgeR,
    EnrichmentBrowser,
    ExperimentHub,
    GEOquery,
    grDevices,
    graphics,
    KEGGandMetacoreDzPathwaysGEO,
    KEGGdzPathwaysGEO,
    methods,
    rappdirs,
    S4Vectors,
    stats,
    utils
Suggests:
    BiocStyle,
    BiocCheck,
    knitr,
    rmarkdown,
    testthat
biocViews:
    Microarray,
    RNASeq,
    GeneExpression,
    DifferentialExpression,
    Pathways,
    GraphAndNetwork,
    Network,
    GeneSetEnrichment,
    NetworkEnrichment,
    Visualization,
    ReportWriting
VignetteBuilder: knitr
RoxygenNote: 6.0.1

Consider adding SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.

bioc-issue-bot commented 6 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

84ce140 reduce loadEData example c76ede1 reduce loadEData example

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

f0f9899 reduce maPreprocApply Example

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

5b6bb46 adapt user_data_dir for windows

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

55d0438 adapting BiocParallelParam example for windows

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

lawremi commented 6 years ago

I was surprised (not so pleasantly) that the API is based on a vector of IDs that key into a hidden environment within the package. The deApply() function, confusingly named because it does not apply a user-provided function like all other apply functions, appears to return the results by side effect. Is there a reason to not simply pass values around? Could a MultiAssayExperiment represent a compendium?

lgeistlinger commented 6 years ago

@lawremi Thanks for the feedback!

I think of a compendium as a collection of datasets of the same assay type (e.g. RNA-seq), where samples are typically distinct between datasets.

So I guess a compendium is suitably represented by a list of datasets each being a SummarizedExperiment, or one huge SummarizedExperiment as e.g. done for the GSE62944(TCGA RNA-seq) compendium.

I decided for the first option when creating the package in the first place, but I'm currently wondering whether the second option might be preferred?

lgeistlinger commented 6 years ago

Not sure whether using a hidden environment has been a good decision (apparently not), but I thereby wanted to ensure reproducibility of the benchmark routines as this

i. prevents users from intentionally or accidentally modifying the underlying compendium ii. enables me to provide a defined and exclusive set of manipulation operations on the compendium iii. does not pollute the user's workspace (.GlobalEnv), into which I would have otherwise loaded the compendium

lawremi commented 6 years ago

Why is it a big deal if they do modify the compendium? Does it break the package? The R/Bioc philosophy seems to be to give the user freedom, even if it means they might shoot themselves in the foot every once in a while. If you load the compendium as a single object, the user only needs a single reference to it from the global environment.

lgeistlinger commented 6 years ago

Agreed, my Apply-functions (maPreprocApply, deApply, eaApply, and evalApply) might be confused with the canonical apply functions.

However, the name came closest to what these functions do: they apply (via BiocParallel::bplapply) a function of the indicated name to the compendium of datasets, e.g. a differential expression analysis in the case of deApply.

Open for other naming suggestions.

lawremi commented 6 years ago

Does the package even need to compute differential expression? Or could it take results as input? The focus is obviously on comparing enrichment results.

lgeistlinger commented 6 years ago

Generally starting from DE results was not an option as many enrichment methods use sample permutation (such as GSEA, SAFE, SAMGS, ...). This means they recompute dif- ferential expression in each permutation of the sample labels for gene set significance estimation and thus require the complete expression matrix.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

1ff3344 resolve hidden environment

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

58458b2 adapt for windows settings

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

lgeistlinger commented 6 years ago

@lawremi The compendia are now represented as a list of SummarizedExperiments and are not locked anymore in a hidden environment, but are rather passed around by the functions of the package. Thanks a lot for the input!

@dvantwisk any remarks from your side?

Also: what's this postprocessing error about? the building report does not provide any information here what the problem is ...

I'm also going to reduce time needed for the examples further to not run into the timeout on the Mac building machine ... although I cannot reproduce this long checking time on my MacBook Air here ...

dvantwisk commented 6 years ago

The post-processing error is occurring on our end, the issue should be corrected now and should be gone if you build with another version bump.

All I can recommend with regards to the timeout is to continue to try to reduce the run-time of check. There's nothing specific that I can give you on this, and unfortunately the different hardware that you and merida2 (our mac builder) is using to run the check will result in differing amounts of time needed to run it. The amount of time needed to run it must be less than ten minutes to pass without a timeout.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

300dfc1 reduce examples

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

aeaf765 reducing examples 2 98fb42b reducing examples 2

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

lgeistlinger commented 6 years ago

@dvantwisk: FYI: the error thrown on MAC OS X by serialize seems to be a general problem in R. I have contacted @mtmorgan on that and I am in good hope that this can be fixed soon in order to move on with the package.

dvantwisk commented 6 years ago

I am currently reviewing the package. I'll try to have something back for you soon.

dvantwisk commented 6 years ago

Your package is written very well. I don't have any knit-picks about the code or structure of the package, I only have a few suggestions. These are all things to consider, so message me back on whether you plan on doing any of these change or not.

GENERAL:

R/geo2kegg.R

R/gse62944.R

R/io.R

tests

lgeistlinger commented 6 years ago

Hi @dvantwisk

Thanks for the feedback! BiocFileCache is definitively on the TODO. I'll also add some unit tests, but I had this removed because I was running into timeouts as it would exceed the R CMD check time limit. Let me see what I can do about it.

dvantwisk commented 6 years ago

Just messaging you back to check in on your progress. Are you still working on the changes to BiocFileCache or do you need any help? Again, this change is only a suggestion and the package would be fine should you choose not to do it.

lgeistlinger commented 6 years ago

Hi @dvantwisk

Thanks for checking!

Unfortunately a couple of other things came in the way, but I'm almost ready with supporting BiocFileCache in the package and I am expecting to be able to push the new version within this week.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

8285ac0 supporting BiocFileCache

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

lgeistlinger commented 6 years ago

Hi,

I just pushed a new version of the package addressing several issues.

@dvantwisk I found caching particularly useful in the case of extensive preprocessing of the benchmarking compendia. Support to flexibly save and restore an already processed compendium based on BiocFileCache is now integrated in the loadEData function and documented in Section 6.1 of the vignette.

@lawremi To avoid confusion, all *Apply-functions of the package have been renamed. In particular:

deApply -> runDE
eaApply -> runEA
evalApply -> evalRelevance
maPreprocApply -> maPreproc

They now extend naturally from a single SummarizedExperiment to a list of SummarizedExperiments.

@mtmorgan Not sure whether there are any new insights concerning the serialize issue under Mac (http://bioconductor.org/spb_reports/GSEABenchmarkeR_buildreport_20180222155936.html#merida2_check_anchor), but I'm sure that there are currently quite some other urgent things due to the upcoming release.

I am thus currently approaching the issue like this (if not suggested otherwise):

#
# iterate over expression datasets
#
.iter <- function(exp.list, f, ..., parallel=NULL)
{
    is.windows <- is.null(parallel) && .Platform$OS.type == "windows"
    if(is.windows) parallel <- BiocParallel::SerialParam()
    if(is.null(parallel)) parallel <- BiocParallel::bpparam()

    fail <- TRUE
    trials <- 0
    while(fail && trials < 5)
    {   
        res <- try(
            BiocParallel::bplapply(exp.list, f, ..., BPPARAM=parallel),
            silent=TRUE)
        fail <- is(res, "try-error")
        trials <- trials + 1 
    }   
    if(fail) stop(res)
    return(res)
}

I found the serialize error to be thrown roughly in one out of three trials, so I typically make it through the check like this on my Mac and apparently also on the build system. Also, I never encounter this error under Linux.

dvantwisk commented 6 years ago

The changes related to BiocFileCache look great. I'm not sure if anyone you've @'d has anything else to add. If you are finished making changes, just message me and I'll move to accept the package.

lgeistlinger commented 6 years ago

Hi @dvantwisk I think we can go ahead + thanks a lot for taking the time to review the package!

bioc-issue-bot commented 6 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.

Thank you for contributing to Bioconductor!

mtmorgan commented 6 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/lgeistlinger.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"GSEABenchmarkeR\"). The package 'landing page' will be created at

https://bioconductor.org/packages/GSEABenchmarkeR

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.