Bioconductor / bioc_docker

[DEPRECATED] Docker containers for Bioconductor
https://github.com/bioconductor/bioconductor_docker
Artistic License 2.0
49 stars 27 forks source link

devel_proteomics build cancelled due to timeout #61

Closed sneumann closed 5 years ago

sneumann commented 6 years ago

The docker hub builds get cancelled after about two hours. Locally devel_proteomics2 builds on a 3yr old workstatin in about 36:36.04 elapsed minutes. Yours, Steffen

sneumann commented 6 years ago

These are the (to be) installed packages:

'ASEB', 'Cardinal', 'CausalR', 'CellNOptR', 'ChemmineOB', 'cisPath', 'clippda', 
'CNORdt', 'CNORfeeder', 'CNORode', 'cydar', 'deltaGseg', 'DEP', 'DEqMS', 'diffcyt', 
'DominoEffect', 'Doscheda', 'drawProteins', 'eiR', 'fCI', 'fmcsR', 'GraphPAC', 
'HPAanalyze', 'IMMAN', 'InterMineR', 'iPAC', 'IPPD', 'kimod', 'LPEadj', 'mlm4omics', 
'MSstatsQC', 'MSstatsQCgui', 'omicRexposome', 'PAA', 'Path2PPI', 'Pbase', 'PCpheno', 
'PECA', 'pepXMLTab', 'PGA', 'pgca', 'phosphonormalizer', 'plgem', 'PLPE', 
'PowerExplorer', 'ppiStats', 'procoil', 'ProCoNA', 'pRolocGUI', 'ProteomicsAnnotationHubData', 
'Pviz', 'qcmetrics', 'qPLEXanalyzer', 'QuartPAC', 'rain', 'RCASPAR', 'Rchemcpp', 'Rcpi', 
'readat', 'ROTS', 'RpsiXML', 'sapFinder', 'ScISI', 'shinyTANDEM', 'SLGI', 'SpacePAC', 
'spliceSites', 'topdownr', 'TPP', 'XINA', 'CardinalWorkflows', 'faahKO', 'gcspikelite', 
'iontreeData', 'metaMSdata', 'msPurityData', 'msqc1', 'MSstatsBioData', 'mtbls2', 
'plasFIA', 'ProData', 'PtH2O2lipids', 'qPLEXdata', 'RMassBankData', 'topdownrdata'

I note quite a few Data packages. faahKO, RMassBankData, mtbls2, plasFIA ... I think some should be moved. Or maybe even get rid of the MassSpectrometryData View in the containers altogether ? I note a few packages that would also make sense in metabolomics: Cardinal, ChemmineOB, fmcsR, Pviz, Rchemcpp, Maybe we can move them towards protmetcore ? Or maybe some are dependencies of the Data packages, and gone if we remove that. Yours, Steffen @lgatto could you have a look ?

sneumann commented 6 years ago

I checked locally, and removing the MassSpectrometryData saves around 3 minutes (10%) of build time, and image size comes down from ~11GB to ~9GB. I recommend to remove the data nevertheless.

sneumann commented 5 years ago

Hi, I am (also) working on the timeout of devel_protmetocre2. For the timing optimisations to be a bit less manual work, I have the following set of scripts to determine what takes how much time.

First, you cut&paste from the beginning of the output from docker build, which packages it wants to install. The build can be interrupted, it is not needed.

Then, inside the FROM image, using `docker run --rm -it FROMIMAGE bash' with a tiny script to install a package specified on the command line, and a loop across all these packages, I can gather all build times (including dependencies and all downloads).

installit.R:

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

library(BiocManager)
BiocManager::install(args[1]
apt install time # for /usr/bin/time

# Install all packages 
for F in 'bioassayR' 'BioNetStat' ...  'gridExtra' ; do /usr/bin/time --output $F.timing ./installit.R $F ; done 

# Collect timing:
for F in *.timing ; do echo -n $F "        " ; cat $F | grep -v output | cut -d " " -f 3 | cut -d e -f 1 ; done

Then the most time consuming can be shuffled to other Dockerfiles.

I am also preparing a way to use the download statistics to eventually automagically move packages with a low download number (from the BioC statistics) into an devel_metabolomics_extra2 package.

Yours, Steffen

sneumann commented 5 years ago

So, the stats code would be:

pkgs_to_install <- c('BiocVersion', 'biocViews', 'ProtGenerics', 'mzR',
  'MSnbase', 'msdata', 'BiocParallel', 'knitr', 'rmarkdown', 'httr', 'XML',
  'zlibbioc')

yr <- format(Sys.time(), "%Y")

## http://bioconductor.org/packages/stats/bioc/xcms/xcms_2018_stats.tab
## http://bioconductor.org/packages/stats/data-experiment/msdata/msdata_2018_stats.tab
staturl <- "http://bioconductor.org/packages/stats/"

downloads <- t(sapply (pkgs_to_install, function(pkg) {

    urls <- paste(staturl, c("bioc", "data-experiment"), "/",
                 pkg, "/", pkg, "_", yr, "_stats.tab", sep="")
    pkgdownloads <- sapply(urls, function(url) {
        stats.tab <- try(read.delim(url))
        ifelse(class(stats.tab) == "try-error",
               NA, 
               stats.tab[grep("all", stats.tab[,"Month"]), "Nb_of_distinct_IPs"])
    }, USE.NAMES=FALSE)

    pkgdownloads
}))

## Retain the packages with the topX
topX <- 0.75
popular <- sort(apply(downloads, MARGIN=1, FUN=function(x) max(x, na.rm=TRUE)),
                decreasing = TRUE)
popular <- popular[seq(1, (length(popular)*topX))]
names(popular)

But I realised that you don't want that added to install.R during build time, because you don't want a package come and go when stats change the ordering. Instead, one can add that statically to devel and release based on last year's download stats. Yours, Steffen

lshep commented 5 years ago

This still appears to be an issue see the most recent log:

https://cloud.docker.com/u/bioconductor/repository/registry-1.docker.io/bioconductor/devel_proteomics2/hub-builddetail/bxaqrfgtxsl2vqtrvdtxey

lshep commented 5 years ago

The release_proteomics has not built in over a year - The last success build was for R3.4.4 and Bioc3.6 - If this is not remedied before the next release we will remove it from the README and list of supported dockers. The devel has also not built for over an year.

sneumann commented 5 years ago

There was a successful build yesterday: https://cloud.docker.com/u/bioconductor/repository/docker/bioconductor/devel_proteomics2/builds Yours, Steffen