UCSF-Costello-Lab / LG3_Pipeline

The original LG3 pipeline
https://github.com/UCSF-Costello-Lab/LG3_Pipeline
0 stars 0 forks source link

Add built-in support for `lg3 run QC1`, `lg3 run QC2`, and `lg3 run QC3` #170

Open HenrikBengtsson opened 2 years ago

HenrikBengtsson commented 2 years ago

Issue

Currently one has to do something like(*):

$ mkdir -p ~/pipelines/exomeQualityPlots 
$ cd ~/pipelines/exomeQualityPlots 
$ git clone git@github.com:SRHilz/exomeQualityPlots.git
$ cd ${LG3_HOME}
$ ln -s ~/pipelines/exomeQualityPlots exomeQualityPlots

in order to run lg3 run QC1, etc.

(*) @ivan108 says there's more to it than this.

Objective

The use should not have to do the above, it should come with the installed LG3 pipeline, i.e. with module load lg3. It should work.

Constraints

ivan108 commented 2 years ago

Issue: some exomeQualityPlots scripts require different software versions compared to versions in the current master lg3.conf

  1. A newer version of BEDTOOLS is required to run get_coverage.sh - bedtools2/2.26.0. Currently lg3.conf loads bedtools2/2.16.2

  2. A newer R version is required to run plot_qualinfo.R, to accommodate dplyr and other packages. E.g. r/3.6.3. Currently lg3.conf loads r/3.2.0

  3. Default python 2.7.5 on C4 doesn't have a pysam module, needed to run afTERThought.py

Possible solutions:

For 1. and 2. we could create user level lg3.conf and load needed modules there, to overwrite some items in master lg3.conf?

For 3. Install globally pysam on C4 python 2.7.5 (sysadmin help needed)

HenrikBengtsson commented 2 years ago

Issue: some exomeQualityPlots scripts require different software versions compared to versions in the current master lg3.conf

  1. A newer version of BEDTOOLS is required to run get_coverage.sh - bedtools2/2.26.0. Currently lg3.conf loads bedtools2/2.16.2

Ideally, we could use the same bedtools2/2.16.2 (or newer) for both steps and still produce backward compatible results. However, until we have verified that is the case, we could use a separate:

module load bedtools2/2.26.0    2> /dev/null && BEDTOOLS_QC=$(which bedtools)

for the exomeQualityPlots steps. I think this should still work because all we're after is the which bedtools in the two cases, which we record, so it doesn't matter if another version is loaded later or unloaded.

  1. A newer R version is required to run plot_qualinfo.R, to accommodate dplyr and other packages. E.g. r/3.6.3. Currently lg3.conf loads r/3.2.0

So, tests that I run a few weeks ago showed that it the core pipeline ran through just fine with r/4.1.1. I didn't check the plots, but I doubt that would make a difference.

  1. Default python 2.7.5 on C4 doesn't have a pysam module, needed to run afTERThought.py

As discussed in https://github.com/UCSF-CBI/c4-help/issues/57#issuecomment-952325199, the user who runs the LG3 pipeline can install this themself.

However, in the long run, it would probably be nice if the LG3 pipeline came with this pre-installed. That can be done by installing pysam into a virtual environment part of the LG3 installation (https://www.c4.ucsf.edu/howto/python.html), but lets wait with that approach. We probably wanna do something similar for all the R packages needed.

Possible solutions:

For 1. and 2. we could create user level lg3.conf and load needed modules there, to overwrite some items in master lg3.conf?

For 3. Install globally pysam on C4 python 2.7.5 (sysadmin help needed)

So, we don't want to have the LG3 pipeline depend on the system where it runs or sysadms. Our goal should be to design it so that anyone can install and run it, and run it anywhere, not just C4.

ivan108 commented 2 years ago

exomeQualityPlots pipeline is now integrated into LG3 (see develop_v2 branch).