ejh243 / BrainFANS

Complex Disease Epigenomics Group's quality control and analysis pipelines for DNA methylation arrays, SNP arrays, BS-Seq, ATAC-Seq and ChIP-Seq
Other
8 stars 4 forks source link

[Bug]: DNAm install libraries script not working on clean install #164

Closed sof202 closed 1 month ago

sof202 commented 2 months ago

What happened?

The required libraries for the DNAm pipeline will not download from a completely fresh install.

The root cause of the error seemed to be wateRmelon. My current hypothesis is that the remotes package is attempting to install the incorrect dependency versions from Bioconductor (resulting in errors like 'This package is not available for your version of R'). This can be remedied by installing each of the problem dependencies individually with the BiocManager::install() function.


In addition to this, using BiocManager::install("DelayedMatrixStats") (a dependency of the minfi package) will result in an error on our HPC due to a package conflict with Matrix (this package is already installed in the .site.Library under version 1.4-1, but minfi requires version >=1.5.0). This will need to be remedied by using devtools::install_version("Matrix", version = "1.5") like we do with MatrixStats already (Note: a higher version of Matrix may also work).


Here is the list of wateRmelon dependencies that incorrectly install when using remotes::install_github("schalkwyk/wateRmelon"):

[!IMPORTANT] To ensure maximum portability, we should consider installing all of the dependencies to a separate library directory specifically for this pipeline. This has pros and cons.

Pros

The main pro is that we can test if necessary packages for the pipeline are always installing properly (as installing packages to a default library directory might lead to false positives/negatives due to previously installed packages from other pipelines/projects).

Also this method will be easy to implement (just add the lib.loc="path/to/R/library option to the end of each library() and package installation call, which can be done with a couple of sed commands).

Cons

The main con is that we will be installing multiple instances of the same packages which takes up unnecessary space.

It also will become quite annoying to implement this into every pipeline in BrainFANS and communicate this to users.

Further still, this might not even work for all package installation as the lib.loc option might not get carried down into dependency installation calls (resulting in more errors and so more manual installation calls would be required).

Alternatives

There does exist renv which helps you to create isolated R environments (a bit like a conda environment). This is used by some RSE members I believe.

In a similar vein, packrat is very similar if renv doesn't tickle your fancy. (Be warned that this is mainly deprecated and slow). There's also checkpoint, which is again very similar.

If we want to go through the efforts of creating our own conda repositories, we can use conda instead (this would be fine if some of the packages weren't being obtained from GitHub).

How can the bug be reproduced?

Step 1 - Create new library for fresh install

echo ".libPaths('~/R-new-library/')" >> ~/.Rprofile

Step 2 - Run the install libraries script

# dataDir isn't actually required for this error to appear.
Rscript .../array/DNAm/preprocessing/installLibraries.r dataDir 

Relevant log output

> BiocManager::install("DelayedMatrixStats")     
...
...
Error in loadNamespace(j <- imp[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
  namespace ‘Matrix’ 1.4-1 is already loaded, but >= 1.5.0 is required
...
...
Warning message:
In install.packages(...) :
  installation of package ‘DelayedMatrixStats’ had non-zero exit status
sof202 commented 1 month ago

Attemping renv in PR #175