dms-vep / dms-vep-pipeline-3

Pipeline for analyzing deep mutational scanning (DMS) of viral entry proteins (VEPs)
Other
1 stars 0 forks source link

create options to **not** require FASTQ input #141

Closed jbloom closed 3 weeks ago

jbloom commented 3 weeks ago

The GitHub repos typically do not track the actual FASTQs, but just the barcode counts and variant-barcode lookup table generated from them. The reason is that the FASTQs are very large files. Although those can be uploaded to the SRA, it would also be nice to have a cleaner option where someone can just re-run the pipeline with the pre-computed barcode counts and lookup table.

jbloom commented 3 weeks ago

Addressed in version 3.14.0 as follows:

Re-running the pipeline without the FASTQ files

The pipeline can perform the entire analysis starting with the FASTQ files holding the results of the PacBio sequencing (used to build the barcode-variant lookup table) and the Illumina sequencing (used to count the barcodes) to the final processed results. However, these FASTQ files are typically very large and so are not tracked in the GitHub repo but need to be stored somewhere else like on a computing cluster, either at the locations they were originally generated or (for secondary use) where they were downloaded from the NCB Sequence Read Archive (SRA). The locations of those files are specifed in the pacbio_runs and barcode_runs CSV files indicated in the config.yaml, and will be specific to the configuration of the computing cluster for which the pipeline is being run since the files are too large to store within a GitHub repo.

However, for many re-use purposes, secondary users do not really need to re-process the FASTQ files are the barcode counting and barcode-variant lookup table construction are fairly simple, and secondary users may just be happy to use the lookup table and counts generated from prior processing of the FASTQ files. If you are using a repo where these counts and the barcode-variant lookup table are already computed and stored in the repo, you can then just start with those and avoid having to handle the FASTQ files at all. To do that, you set the following options in the configuration YAML (config.yaml) as follows:

prebuilt_variants: results/variants/codon_variants.csv  # use codon-variant table already in repo
prebuilt_geneseq: results/gene_sequence/codon.fasta  # use gene sequence already in repo
...
use_precomputed_barcode_counts: false  # use barcode counts already in repo

Then running the repo will no longer require any FASTQ files, and will juse utilize the precomputed variants and counts from those files stored in the repo.