Requesting 79 Bioconductor packages

aitap commented 9 months ago

Hi!

While installing a large number of packages from both CRAN and Bioconductor on a container running the docker.io/rocker/r2u image, I got some of the Bioconductor dependencies from source. I can start a fresh copy of the container and run:

# podman run -it docker.io/rocker/r2u
apt update && apt -y full-upgrade
R -e 'setRepositories(ind=1:4); install.packages(c(
"derfinderHelper", "EBarrays", "RTCGA", "sesameData", "GenomicScores",
"derfinder", "splots", "NanoStringNCTools", "org.Rn.eg.db", "biocthis",
"geNetClassifier", "gDRtestData", "RTCGA.miRNASeq", "CytoML",
"BioNet", "rols", "pRolocdata", "sesame", "RTCGA.mutations",
"JASPAR2020", "LOLA", "msdata", "CAMERA", "RMassBankData", "Homo.sapiens",
"PFAM.db", "ALLMLL", "gDRutils", "RTCGA.CNV", "JASPAR2018", "drawProteins",
"hgu133plus2.db", "RBioFormats", "grasp2db", "phastCons100way.UCSC.hg38",
"RTCGA.rnaseq", "simpIntLists", "rpx", "microRNA", "gDRstyle",
"MotifDb", "wiggleplotr", "hgu95av2.db", "JASPAR2014", "KOdata",
"faahKO", "flowWorkspaceData", "DAPARdata", "RTCGA.RPPA", "recount",
"RTCGA.methylation", "MafH5.gnomAD.v3.1.2.GRCh38", "pRoloc",
"humanStemCell", "cellHTS2", "org.Bt.eg.db", "SPIA", "RTCGA.mRNA",
"RTCGA.clinical", "human.db0", "rae230aprobe", "org.Sc.sgd.db",
"GeomxTools", "lydata", "ChemmineOB", "MafDb.1Kgenomes.phase1.hs37d5",
"rsbml", "biodb", "ReportingTools", "gwascat", "rae230a.db",
"humanCHRLOC", "hgu133a.db", "HubPub", "pasilla", "DAPAR", "SNPlocs.Hsapiens.dbSNP144.GRCh37",
"JASPAR2016", "CCl4"
))'

It will say Install system packages as root... twice, install a number of *.deb packages on the second run and then install the remaining packages from source (some of these are large and may benefit from an increased download timeout). I plucked the source package names from the console by grepping for ^trying URL. I really appreciate installing only these 79 packages from source instead of all the 1316 dependencies, and I would be grateful if you package these ones too. Please let me know if I can help!

Speaking of system dependencies,

CytoML wants libxml2 (and will fail to compile without libxml/tree.h)
ChemmineOB wants libopenbabel and Eigen (and will fail to compile without openbabel/obutil.h and Eigen/Core)
rsbml wants libsbml (and will fail to configure without a corresponding .pc file)

eddelbuettel commented 9 months ago

Seventynone is a lot. That increases what we do for BioConductor by almost 20%, and this is manual for manaul.

Is there a change you can break it down by package group?

(And yes, nobody gives us anything precompiled so BioConductor is always from source.)

eddelbuettel commented 9 months ago

Another thing you could do is to look at

suppressMessages({
    library(data.table)
})

## get most used BioC packages from https://bioconductor.org/packages/stats/bioc/bioc_pkg_scores.tab
S <- fread("https://bioconductor.org/packages/stats/bioc/bioc_pkg_scores.tab", showProgress=FALSE)
setnames(S, c("package", "score"))
S[, lcpkg := tolower(package)]
S <- S[order(-score), .SD[1,], by="lcpkg"]

the BioConductor score. Maybe we can add the most important of these packages (and its dependency tail) first, then the next and so on.

aitap commented 9 months ago

I definitely wouldn't want to increase your manual load by 20%. These are not the most popular packages, but they are in the top 90% by score:

r2u <- ... # list of 'all', 'amd64' packages in the repository for Jammy
pk <- c(
"derfinderHelper", "EBarrays", "RTCGA", "sesameData", "GenomicScores",
"derfinder", "splots", "NanoStringNCTools", "org.Rn.eg.db", "biocthis",
"geNetClassifier", "gDRtestData", "RTCGA.miRNASeq", "CytoML",
"BioNet", "rols", "pRolocdata", "sesame", "RTCGA.mutations",
"JASPAR2020", "LOLA", "msdata", "CAMERA", "RMassBankData", "Homo.sapiens",
"PFAM.db", "ALLMLL", "gDRutils", "RTCGA.CNV", "JASPAR2018", "drawProteins",
"hgu133plus2.db", "RBioFormats", "grasp2db", "phastCons100way.UCSC.hg38",
"RTCGA.rnaseq", "simpIntLists", "rpx", "microRNA", "gDRstyle",
"MotifDb", "wiggleplotr", "hgu95av2.db", "JASPAR2014", "KOdata",
"faahKO", "flowWorkspaceData", "DAPARdata", "RTCGA.RPPA", "recount",
"RTCGA.methylation", "MafH5.gnomAD.v3.1.2.GRCh38", "pRoloc",
"humanStemCell", "cellHTS2", "org.Bt.eg.db", "SPIA", "RTCGA.mRNA",
"RTCGA.clinical", "human.db0", "rae230aprobe", "org.Sc.sgd.db",
"GeomxTools", "lydata", "ChemmineOB", "MafDb.1Kgenomes.phase1.hs37d5",
"rsbml", "biodb", "ReportingTools", "gwascat", "rae230a.db",
"humanCHRLOC", "hgu133a.db", "HubPub", "pasilla", "DAPAR", "SNPlocs.Hsapiens.dbSNP144.GRCh37",
"JASPAR2016", "CCl4"
)
S <- lapply(list(
 fread("https://bioconductor.org/packages/stats/bioc/bioc_pkg_scores.tab", showProgress=FALSE),
 fread("https://bioconductor.org/packages/stats/data-annotation/annotation_pkg_scores.tab", showProgress=FALSE),
 fread("https://bioconductor.org/packages/stats/data-experiment/experiment_pkg_scores.tab", showProgress=FALSE)
), function(S) {
 setnames(S, c("package", "score"))
 S[, lcpkg := tolower(package)]
 S <- S[order(-score), .SD[1,], by="lcpkg"]
 pkS <- S[package %in% pk]
 pkS$CDF <- ecdf(S$score)(pkS$score)
 deps <- tools::package_dependencies(
  pkS$package, which = 'strong', recursive = TRUE
 )
 pkS$strongdeps <- lengths(deps)
 pkS$not.in.r2u <- lengths(lapply(deps, function (deps)
  setdiff(
   tolower(deps),
   c(gsub('^r-(cran|bioc)-', '', r2u), tolower(tools:::.get_standard_package_names()$base))
  )
 ))
 pkS
})
lapply(S, head)

[[1]]
            lcpkg        package score       CDF strongdeps not.in.r2u
1: reportingtools ReportingTools   694 0.9125602        188          1
2:        motifdb        MotifDb   576 0.9058911         50          0
3:         sesame         sesame   560 0.9047795        143          1
4:           spia           SPIA   557 0.9044090         14          0
5:          rtcga          RTCGA   528 0.9021860        138          0
6:        gwascat        gwascat   520 0.9010745        137          0

[[2]]
            lcpkg        package score       CDF strongdeps not.in.r2u
1: hgu133plus2.db hgu133plus2.db  1537 0.9954666         47          0
2:   org.rn.eg.db   org.Rn.eg.db  1392 0.9950888         46          0
3:   homo.sapiens   Homo.sapiens   980 0.9913109        105          0
4:     jaspar2020     JASPAR2020   923 0.9909331          1          0
5:        pfam.db        PFAM.db   756 0.9901776         46          0
6:     hgu133a.db     hgu133a.db   737 0.9890442         47          0

[[3]]
            lcpkg        package score       CDF strongdeps not.in.r2u
1:        pasilla        pasilla   879 0.9857143        117          0
2:     sesamedata     sesameData   647 0.9816327        104          0
3:         msdata         msdata   482 0.9653061          0          0
4: rtcga.clinical RTCGA.clinical   302 0.9489796        139          1
5:         faahko         faahKO   249 0.9387755         98          0
6:     prolocdata     pRolocdata   210 0.9285714         76          0

So if I had to pick one package, it could be the 3-megabyte experiment package pasilla, or MotifDb which doesn't have extra dependencies.

eddelbuettel commented 9 months ago

Doing it throttled may work well when I have an idle (evening) moment to take a look. Both noted.

eddelbuettel commented 9 months ago

I just added pasilla plus (working down the "score" list) DSS, DMRcate, MungeSumstats.

I may get to MotifDb next time. BioCondutor count now at 396, so 400 looms....

eddelbuettel commented 9 months ago

I added a handful more, following the BioConductor score from the top down til it got to MotifDb. As it involded two more dependencies, the counts is now at 401 BioC packages.

eddelbuettel commented 8 months ago

Just added five more from BioConductor, and will try to chip away at this slowly.

That said, I will also close this now as it wasn't so much an 'issue' as a bit of misunderstanding about scope, process and how the sauce is made here. Please feel free to reopen if you think there is something here I missed.

aitap commented 8 months ago

Thank you very much for all your packaging work! These binaries will save everyone a lot of time. I think I now owe you a favour :) The revdepcheck virtual machine upgrades without a hitch.

(Revisiting this thread, on the other hand, is somewhat embarrassing.)

eddelbuettel commented 8 months ago

Glad to hear it is of help, and yes, these things tend to just magically work once you have the magic sauce sorted out.

I am still going down the 'karma list' so now we are at 'top 260' minus the that currently does not build on Linux for BioC 3.18 (as I learned on the BioC slack) and BiocInstaller which I skipped on purpose (fearing it may cross wires).

eddelbuettel / r2u

Requesting 79 Bioconductor packages #53