UCSF-DSCOLAB / data_processing_pipelines

A repository to store the existing pipelines to process the various CoLabs datasets
0 stars 1 forks source link

v0: Initial computational pipeline for genotyping and expression analysis #5

Closed AlaaALatif closed 1 year ago

AlaaALatif commented 1 year ago
graft commented 1 year ago

Very nice. A couple of random thoughts on the pipeline:

Some thoughts on the overall structure of the pipeline, invocation, etc., which don't necessarily apply uniquely to this workflow, but might be useful to consider here, and I'm not sure where else to start this discussion:

Certainly out of scope here, but I think all of these things should be fleshed out in a pipeline wrapper that handles the init, config, run, audit, etc. functions. Then the process looks something like:

  1. Clone this repo
  2. Put bin/dpp on your PATH
  3. Goto your resource directory and run dpp build bulk_RNAseq
  4. Run dpp setup bulk_RNAseq to install dependencies
  5. Goto your working data directory and run dpp init <your project>. Maybe this prompts you to setup a git remote for your project and makes a default .gitignore for you
  6. CD into <your project>, run dpp config bulk_RNAseq, which generates <your project>.bulk_RNAseq.config.yml
  7. Run dpp run <etc.>.config.yml
  8. Audit, peruse output in output, logs in logs, etc.
AlaaALatif commented 1 year ago

hi @graft , thank you for the detailed feedback. We've managed to address several of the points you bring up here:

Additional points, prefaced with "Certainly out of scope here", have not been incorporated. The aim is to leave room for team-wide discussions before going forward with the "extra" engineering work required to implement these.

dtm2451 commented 1 year ago

Ran into an issue using outputs from the pipeline where bcftools reported reference mismatched for two samples' formatted vcf outputs when I tried merging them with other samples' formatted vcf outputs.

Problem files:

The Errors:

# HS22:
The REF prefixes differ: A vs C (1,1)
Failed to merge alleles at :8566 in sorted.XNEO2-HS22-PPCB1-RSQ1.formatted.vcf.gz
# HS33:
The REF prefixes differ: C vs T (1,1)
Failed to merge alleles at :3474 in sorted.XNEO2-HS33-PECB1-RSQ1.formatted.vcf.gz

MRE:

# Needed to sort and index first
for file in *formatted*.vcf.gz; do
    bcftools sort ${file} -o sorted.${file} -O z 
    bcftools index sorted.${file}
done

# Now merging vcfs
bcftools merge --file-list pool_rsq_libs.list -o merged_ground_truth.bcf -O b
# 'pool_rsq_libs.list' file described below

This merging code relies on a text file naming all the individual sample vcfs to merge. For a minimal example, make a pool_rsq_libs.list file containing:

sorted.XNEO2-HS20-PECB1-RSQ1.formatted.vcf.gz
sorted.XNEO2-HS21-PPCB1-RSQ1.formatted.vcf.gz
sorted.XNEO2-HS22-PPCB1-RSQ1.formatted.vcf.gz