v0: Initial computational pipeline for genotyping and expression analysis

AlaaALatif commented 1 year ago

uses latest standard human reference genome (p14)
sample sheet parsing (single and paired-end data) and concatenating fastqs
adapter trimming
removal of rRNA
read alignment
variant calling
transcriptome quantification

graft commented 1 year ago

Very nice. A couple of random thoughts on the pipeline:

There seems to be a bunch of salmon-related code that you don't end up using?
I wonder what is the utility of doing a kallisto quantification given that there is a STAR quantification being generated anyway.

Some thoughts on the overall structure of the pipeline, invocation, etc., which don't necessarily apply uniquely to this workflow, but might be useful to consider here, and I'm not sure where else to start this discussion:

In my opinion it is a bad habit to check in actual file paths on a real filesystem, and especially to check them in using individual user's directories, as this is hard to maintain and leads to secret knowledge which gets lost. I suggest instead keeping only template strings in the config here, creating a common build directory for resources on c4 (etc.) and create a config file locating these resources on c4 itself that can be passed into the pipeline (perhaps for convenience through an environment variable users can set in their .bashrc).
According to the README, the pipeline also requires you to edit a config file in place in the source tree in order to specify your run; this is not very safe or usable, because the resulting changes are in a source git tree and will thus eventually be discarded, and certainly can't be tracked independently. Furthermore the pipeline can't simultaneously, thereby, be operated on two different configurations. To allow this, these configurations could instead be written into an independent params file (I prefer config.yml) that exists in the project working directory. I.e., the workflow trades in the following dirs:
- Resource directory - where common resource files are located
- Workflow install directory - where this git repo is checked out, hopefully to a clean master state.
- Project working directory - The place where the pipeline config file is located, along with logs for this run and generated output, and any other project-specific code, hopefully tracked in git.
- Scratch directory - local scratch the pipeline can use If "workflow install directory" overlaps with "project working directory", then we lose the independence of projects when running this pipeline.
You also sneak in here a very nice build workflow for prepping indexes, etc., for your pipeline, which is a crucial and oft-omitted utility of workflows. However, I'm not sure these things should be decoupled as they are. Each workflow we write will require some set of resources, and your workflow contains nicely modularized code for preparing these resources. However, the prep_genome workflow actually prepares resources for the bulk_RNASeq workflow in particular, which isn't very re-usable. You might instead write the bulk_RNASeq resource building workflow to run within an alternative build_resources mode of the bulk_RNASeq workflow itself (and leave out the actual prep_genome workflow). This way we can setup new pipelines so that, before we run them, we simply go to our resource directory and run the pipeline in build_resources mode, and these setup workflows can make use of a common library of build steps.

Certainly out of scope here, but I think all of these things should be fleshed out in a pipeline wrapper that handles the init, config, run, audit, etc. functions. Then the process looks something like:

Clone this repo
Put bin/dpp on your PATH
Goto your resource directory and run dpp build bulk_RNAseq
Run dpp setup bulk_RNAseq to install dependencies
Goto your working data directory and run dpp init <your project>. Maybe this prompts you to setup a git remote for your project and makes a default .gitignore for you
CD into <your project>, run dpp config bulk_RNAseq, which generates <your project>.bulk_RNAseq.config.yml
Run dpp run <etc.>.config.yml
Audit, peruse output in output, logs in logs, etc.

AlaaALatif commented 1 year ago

hi @graft , thank you for the detailed feedback. We've managed to address several of the points you bring up here:

legacy code for Salmon-based transcriptome quants has been removed
actual file paths have been replaced with placeholder paths in the configuration file config/nextflow.config
locations of resource directories has not been determined yet, hopefully these can be points-of-discussion during team meetings before we decide on them and subsequently implement them

Additional points, prefaced with "Certainly out of scope here", have not been incorporated. The aim is to leave room for team-wide discussions before going forward with the "extra" engineering work required to implement these.

dtm2451 commented 1 year ago

Ran into an issue using outputs from the pipeline where bcftools reported reference mismatched for two samples' formatted vcf outputs when I tried merging them with other samples' formatted vcf outputs.

Problem files:

both in dir /krummellab/data1/alaa/data/tests/bulk_rnaseq/rnax_test_results/halkias_coproject_p14/snps
- XNEO2-HS22-PPCB1-RSQ1.formatted.vcf.gz
- XNEO2-HS33-PECB1-RSQ1.formatted.vcf.gz

The Errors:

# HS22:
The REF prefixes differ: A vs C (1,1)
Failed to merge alleles at :8566 in sorted.XNEO2-HS22-PPCB1-RSQ1.formatted.vcf.gz
# HS33:
The REF prefixes differ: C vs T (1,1)
Failed to merge alleles at :3474 in sorted.XNEO2-HS33-PECB1-RSQ1.formatted.vcf.gz

MRE:

# Needed to sort and index first
for file in *formatted*.vcf.gz; do
    bcftools sort ${file} -o sorted.${file} -O z 
    bcftools index sorted.${file}
done

# Now merging vcfs
bcftools merge --file-list pool_rsq_libs.list -o merged_ground_truth.bcf -O b
# 'pool_rsq_libs.list' file described below

This merging code relies on a text file naming all the individual sample vcfs to merge. For a minimal example, make a pool_rsq_libs.list file containing:

sorted.XNEO2-HS20-PECB1-RSQ1.formatted.vcf.gz
sorted.XNEO2-HS21-PPCB1-RSQ1.formatted.vcf.gz
sorted.XNEO2-HS22-PPCB1-RSQ1.formatted.vcf.gz

UCSF-DSCOLAB / data_processing_pipelines