This is a snakemake workflow for analyzing targeted HiFi sequence datasets. The workflow was designed for Twist gene panels sequenced on PacBio Sequel IIe or Revio systems. Learn more in this App Note. Please note that this workflow is still in development. There may be updates to the input / output files and workflow steps which may affect the behavior of the program.
This workflow uses conda
to install dependencies and set up the compute environment. The workflow is tested with conda v22+
on a SLURM
cluster and we recommend 16 cores, 4 GB RAM per core, on a single machine (64GB total memory). Otherwise, use at your own risk.
# create the base conda environment
$ conda create \
--channel conda-forge \
--channel bioconda \
--prefix ./conda_env \
python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow
We recommend using this GRCh38 reference. Refer to the FAQ for more details on input files and formats.
# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference directory
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Run full workflow starting with demultiplexing for batch <batch_name>
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics
Haplotagged aligned BAM files and VCF files are available for each sample in the whatshap
subdirectory for each sample. These files were produced using Deep Variant and Whatshap. See FAQ for a detailed directory structure.
If pbsv is used in the pipeline, a VCF for structral variants can be found in the pbsv
subdirectory for each sample.
The demux
directory contains unmapped BAM files for each sample. These files are named with the barcode ID. The directory also contains all of the output for lima, the demultiplexing software for PacBio.
See the stats
directory for many useful tables and summary figures. The file hs_metrics_consolidated_quickview.tsv
is a good place to start to understand run performance. Definitions of each statistics can be found here.
read_categories.png
gives a graphical summary of read lengths and the percentage of reads that are demultiplexed, PCR duplicates, unmapped, on and off target.
This file provides a mapping of barcodes to samples. It is used for sample demultiplexing, to trim adapter sequences, and to rename files with sample ID. The file is comma-delimited where the first column is a pair of barcode IDs (forward and reverse, must match the headers in the barcodes fasta) and the second column in a sample ID (must be unique). The csv must contain a header row, but the header names can be arbitrary.
Here is an example:
Barcode,Biosample
UDI_1_A01_f--UDI_1_A01_r,HG002
UDI_2_B01_f--UDI_2_B01_r,HG003
UDI_3_C01_f--UDI_3_C01_r,HG004
UDI_4_D01_f--UDI_4_D01_r,HG005
Please use hifi_reads.bam, the native format for PacBio HiFi data, which is an unmapped BAM file ("ubam"). Learn more here.
By default the pipeline uses the Tru-Seq barcodes from Twist. You can change you barcode file by providing a different fasta file in workflow/barcodes
. For example, the barcode file "Sequel_96_barcodes_v1.IDT_pad.fasta" was used for this dataset and in this preprint. Adjust the path to the barcode file in workflow/config.yaml
and make sure you update your biosample_csv
file.
target.bed
file?This is a standard BED file that specifies the coordinates of the regions of interest. It is used by HS Metrics to compute target coverage, on-target mapping rate, and other statistics.
Here is an example targets.bed
:
chr1 155234451 155244627 GBA
chr5 70049638 70078522 SMN2
chr5 70925030 70953942 SMN1
chr22 42126498 42130810 CYP2D6
probes.bed
file? What if I don't have one?The probes.bed
file specifies the coordinate for the probes use to prepare the target capture library. The file usually contains hundreds or even thousands of probes depending on the size of the panel. You may not have access to the probe design due to it being proprietary. In this case, you can use the target.bed
file in place of the probes.bed
file (you will use target.bed
twice when you call the workflow command).
Here is an example of probes.bed
:
chr1 48037726 48037846
chr1 48038919 48039039
chr1 48040111 48040231
chr1 48041516 48041636
python v3.9+
conda v22+
lima v2.5+
pbmarkdups v1+
pbmm2 v1.7+
deep variant v1.4+
whatshap v1.1+
glnexus v1.4.1+
pbsv v2.8+
htslib v1.13+
hsmetrics v1.3.1
pharmCAT
pangu v0.2.1
1) demultiplex HiFi reads with lima 2) Mark PCR duplicate HiFi reads by sample with pbmarkdups 3) align HiFi reads to reference with pbmm2 4) call small variants with DeepVariant v1.4.0 5) phase small variants with WhatsHap 6) haplotag aligned BAMs with WhatsHap 7) call SV with pbsv 8) jointly call all variants (excl pbsv) with glnexus 9) [optionally] call PGx star (*) alleles with PharmCAT and pangu 10) [optionally] annotate output gVCF with dbsnp or other database containing variant IDs 11) Run some QC reports including hsMetrics
.
├── cluster_logs # slurm stderr/stdout logs
├── reference
│ ├── reference.chr_lengths.txt # cut -f1,2 reference.fasta.fai > reference.chr_lengths.txt
│ ├── reference.fasta
│ └── reference.fasta.fai
├── batches
│ └── <batch_name> # batch_id
│ ├── benchmarks/ # cpu time per task
│ ├── demux/ # demultiplexed hifi reads
│ ├── glnexus/ # intermediate cohort vcf files
│ ├── logs/ # per-rule stdout/stderr logs
│ ├── picard/ # interval lists for hsmetrics
│ ├── stats/ # batch-wide collated reports, including HS Metrics summary
│ ├── whatshap_cohort/ # joint-called, phased SNV
│ ├── merged_gvcf/ # [optional] annotated gvcf with all batch samples included
│ ├── <sample_id 1>/ # per-sample results, one for each sample
│ : ...
│ └── <sample_id n>/ # per-sample results, one for each sample
│ ├── coverage/ # read coverage by target beds
│ ├── deepvariant/ # intermediate DV vcf, incl g.vcf per sample
│ ├── hs_metrics/ # picard hsmetrics for this sample
│ ├── markdup/ # unaligned reads with PCR dups marked
│ ├── pangu/ # [optional] HiFi CYP2D6 star calling results
│ ├── pbsv/ # structural variant calls
│ ├── pharmcat/ # [optional] pharmcat results
│ ├── read_metrics/ # per-read information
│ └── whatshap/ # phased small variants; merged haplotagged alignments
│
└── workflow # clone of this repo
# create the base conda environment
$ conda create \
--channel conda-forge \
--channel bioconda \
--prefix ./conda_env \
python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow
# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Annotation [optional]
# drop your annotation file and index into annotation
# and adjust the [annotate][variants] path in workflow/config.yaml
# ensure that [annotate][gVCF] is set to True in workflow/config.yaml
# PharmCAT [optional]
# Set [pharmcat][run_analysis] to True in workflow/config.yaml
# run the full workflow including demux/markdup/mapping from a single HiFi movie for batch <batch_name>
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# run just variant calling and phasing for a set of bams following demux/markdup/mapping on SL
# <hifi_reads> can be a directory of bams or a textfile with one bam path per line (fofn)
$ sbatch workflow/run_snakemake_SLmapped.sh <batch_name> <hifi_reads> <target_bed> [<probe_bed>]