Sydney-Informatics-Hub / Fastq-to-VCF

High throughput illumina mammalian whole genome sequence analysis and joint genotyping
0 stars 0 forks source link

Fastq-to-VCF

High throughput illumina mammalian whole genome sequence analysis and joint genotyping using BWA-mem2 aligner, DeepVariant variant caller and GLnexus joint genotyper.

This workflow was written for NCI Gadi HPC and uses the nci-parallel custom utility to parallelise tasks across the cluster with PBS Pro and Open-MPI. Adaptation to other infrastructures will need to adjust this parallelisation method.

Usage overview

Clone the repository and change into it

git clone git@github.com:Sydney-Informatics-Hub/Fastq-to-VCF.git
cd Fastq-to-VCF

The scripts are all within Scripts directory, and contain relative paths to the base working directory, so all scripts are to be executed from within the base Fastq-to-VCF working directory. It is not recommended to change directory paths within the scripts; if this is done, downstream scripts will also need to be modified.

All required directories are made by the scripts. Edits need only be made to the PBS directives (NCI project code, lstorage paths, and resource requests). Any other edits (such as specifying the reference sequence and cohort name) are described at the steps below. No other changes to the workflow scripts are required.

All software used in this workflow are globally installed on Gadi, except GLnexus which is a singularity image file (dowlonad instructions provided at the steps below).

Input files

  1. Copy or symlink your paired gzipped fastq files to ./Fastq
    • If you have fastq with '1' and '2' pair designation instead of 'R1' and 'R2', please rename/symlink to 'R1' and 'R2'
  2. Copy or symlink your reference fasta to ./Reference
  3. Create your sample configuration file, which has one row per sample

Example configuration file

Explanation of columns

  1. SampleID: The unique identifier that can be used to associate all fastq files that belong to the sample. Must be unique in the cohort and present as prefix for all fastq files belonging to the sample
  2. LabSampleID: The sample ID that you want to use in your final output files eg BAM, VCF. Can be the same as column 1 or different. Must be unique in the cohort
  3. Seq_centre: The sequencing centre that produced the fastq (no whitespace)
  4. Lib: Library ID. Can be left blank if not relevant; will default to '1'
#SampleID LabSampleID Seq_centre Lib
NA12890 NA12890_mini Unknown 1
NA12878 NA12878_mini Unknown 1

Parallelisation

Some steps are parallel by sample, and other steps are parallel by fastq ("scatter gather parallelism"). Alignment is executed after first splitting the fastq up into smaller pairs, to enable a much higher degree of parallelisation than simply aligning the fastq pairs as-is. Only the MultiQC and final joint genotyping steps are executed as single (not parallel) jobs.

For the parallel jobs, there are 3 workflow scripts per step:

  1. <stepname>_make_input.sh: This script is executed on the login node with bash, and creates the input file required to parallelise the tasks. Each line in the output file Inputs/<stepname>.inputs contains the details for one parallel task such as sample ID, fastq, reference, etc. The PBS script (2) will launch the task script (3) once for every line in this 'inputs' file. Any changes regarding the inputs (such as reference sequence name) should be made in this script; no parameter changes should be required in the PBS (2) or task(3) script.
  2. <stepname>_run_parallel.sh: This script is submitted with qsub and runs the parallel job. Users will need to edit the NCI project code, lstorage directory, and resource requests. Note that the requests for the job are for the whole job, not for each single task. The variable NCPUS within the script body sets the number of CPUs that are allocated to each single task. Tasks may be executed all at once (eg 4 NCPUs per task, 24 tasks, running on 4 x 24 = 96 CPUs). If the total number of CPUs requested for the job is less than the number required to run all at tasks at once, the job wil simply assign tasks to CPUs in the order in which they appear in the inputs list, until all taks are run, or walltime is exceeded. For jobs where all tasks are expected to have similar walltime, requesting the number of total CPUs such that all tasks can be run at once (up to the Gadi queue limits) is reasonable. For jobs where unequal walltimes are expected, for example in cancer studies where tumour has greater coverage than normal samples, size-sorting the inputs largest to smallest or separating the job into separate batches with different inputs lists and walltimes will provide better CPU and KSU efficiency.
  3. <stepname>.sh: This is the task script that contains the commands required to run the job. It is launched once per task by the PBS script (2) and is not submitted directly by the user. There should not be need to edit this script unless changes to the actual analysis task are warranted.

Benchmarking

SIH workflows aim to provide benchmarking metrics and resource usage guidelines. At the time of writing (uly 31 2024) this workflow is in its infancy so these details are not yet available. They will be added over time.

Parallel workflows always benefit from benchmarking. This extra workload at the start will save you time and KSU in the long run. At minimum, it is recommended to benchmark the alignment step, which uses the most KSU out of the whole workflow. The most efficient alignment threads and walltime per task will differ depending on the fastq split size, and will also vary somewhat between datasets, depending on reference genome quality, raw data quality, sample genetic complexity, etc. A starting split size of 10 million reads (split value of 40000000 lines) is a good starting point for benchmarking alignment against your indexed reference genome with 4, 6, 8 or 12 CPUs per alignment task.

The duplicate marking and sorting step should also be tested on one sample (typically the largest BAM) to establish compute requirements for that step that cannot be parallelised via splitting.

We plan to develop a workshop on benchmarking bioinformatics workflows, but in this benchmarking template may be useful.

Compute resource usage summary and CPU efficiency calculations can be obtained using this Gadi usage report script. To use, download a copy of the script, change into the directory where your PBS logs are saved, and run the script with perl.

Logs

Tool logs are created into the Logs directory, for example the align step will create per-task logs into Logs/BWA.

PBS logs are written to PBS_logs.

Run the workflow

0. Index the reference genome

The reference fasta to index should be within (or symlinked to) the ./Reference directory.

This step need only be done once per reference. If previously generated indexes for bwa-mem2 and SAMtools(with the same tool versions) are available, copy or symlink them to the ./Reference directory along with the fasta. They must all have the same filename prefix.

Edit the script Scripts/index_reference.pbs:

The human T2T genome completed in 42 minutes and 85 GB RAM.

Submit:

qsub Scripts/index_reference.pbs

Sucessful indexing should create 6 index files within the ./Reference directory. Check the PBS logs PBS_logs/index_reference.e and PBS_logs/index_reference.o for errors.

1. FastQC

This script will run FastQC over every single fastq file, using 1 thread per fastq file.

Ensure you have copied or symlinked your paired gzipped fastq files to ./Fastq.

Make parallel inputs file

bash Scripts/fastqc_make_input.sh

Edit the PBS script

Edit Scripts/fastqc_run_parallel.pbs:

Submit

qsub Scripts/fastqc_run_parallel.pbs

Check

grep "exited with status 0" PBS_logs/fastqc.e | wc -l

2. FastQC multiQC

This script can be run on the login node for small numbers of fastqc reports, or submitted to the compute queue for larger cohorts.

To run on login node for small cohorts

bash Scripts/multiqc_fastqc.pbs

To run on compute queue for larger cohorts

Edit the script Scripts/multiqc_fastqc.pbs

Submit:

 qsub `Scripts/multiqc_fastqc.pbs`

Output report will be in ./MultiQC/FastQC.

3. Split fastq

Paired FASTQ files are split into smaller files with fastp v.0.20.0 (Chen et al. 2018) to enable higher levels of parallelisation.

Fastq split size

The number of lines per output fastq file is governed by the -S parameter to fastp. If you would like to adjust the number of reads per split pair, edit the variable split within Scripts/split_fastq_make_input.sh. Note that -S adjusts the number of lines, so set split to the number of desired reads per split fastq times 4.

A split fastq pair containing 2 million read pairs (4 million reads total) on 12 CPU takes aproximately 2 minutes to align (human, T2T reference genome). The same data using 500,000 read splits requires around 1 minute. Creating split sizes that are too small can reduce efficiency due to the overhead of tool and reference index loading. Since the RAM required for BWA-mem2 is higher than for BWA-mem1, out-of-memory failures can occur when requesting small CPU per task values. For this reason, we recommend splitting the fastq to 10 million reads, ie setting -S 40000000 within the fastp command. 10 million paired-end reads on 12 CPU on Gadi's normal nodes requires ~ 40 GB RAM and 14-15 minutes per task. The alignment step size sorts the split fastq to ensure that split files created from the end of large input fastqs are processed last, thus maximising CPU efficiency.

Apart from adjusting the value of split, no other edits are required to the make_input script.

Make parallel inputs file

bash Scripts/split_fastq_make_input.sh

Edit the PBS script

Edit Scripts/split_fastq_run_parallel.pbs:

Submit

qsub Scripts/split_fastq_run_parallel.pbs

Check

grep "exited with status 0" PBS_logs/split_fastq.e | wc -l

Notes on fastp

4. Split fastq check

This ensures you have equal reads in your R1 and R2 as well a that the splitting process has not introduced any errors.

It is a parallel script, yet there is need to create a new inputs file, as the ./Inputs/split_fastq.inputs file from the splitting step is used.

Submit

Adjust the PBS directives as described for previous parallel steps, accomodating for 12 CPU per parallel task, then submit:

qsub split_fastq_check_fastq_input_vs_split_output_run_parallel.pbs

Check

grep "has passed all checks" ./Logs/Check_fastq_split/* | wc -l

5. Align the split fastq

Alignment is performed with bwa-mem2 which is up to 2X faster than bwa-mem1 with identical results. The K value is applied to ensure thread count does not affect alignment output due to random seeding.

This is the first step in the workflow that requires your sample configuration file. Please ensure this matches the format described in Example configuration file.

Make parallel inputs file

First, open the scipt and update the variable ref to your reference fasta, eg ref=./Reference/mygenome.fasta. Check that the regex for the fastq pair matching matches your reads and adjust as required. Expected filename format is _R1.fastq.gz, _R1.fq.gz, .R1.fastq.gz or .R1.fq.gz.

Flowcell and lane are extracted from Illumina-formatted read ID. If your read IDs do not match this format, please adjust the regex as required. Allowance is made for SRA-derived reads by checking for read ID beginning with '@SRR' and using 'unknown' for flowcell and lane.

Save the script and provide the sample configuration file name as a command line argument:

bash Scripts/align_make_input.sh <samples.config>

Edit the PBS script

Edit Scripts/align_run_parallel.pbs:

Submit

qsub Scripts/align_run_parallel.pbs

Check

6. Create final BAMs (merge, mark duplicates, sort, index)

This job is parallel by sample, so the number of tasks will be equal to the number of samples.

It merges (gathers) the split (scattered) alignment files, completing the scatter-gather parallelism enabled by physically splitting the fastq. Merging is performed with SAMBAMBA, duplicate marking with SAMBLASTER and sorting and indexing with SAMtools.

Make parallel inputs file

bash Scripts/final_bam_make_input.sh <samples.config>

Edit the PBS script

Edit Scripts/final_bam_run_parallel.pbs:

7. BAM QC

There are many tools for running QC on BAM files. We have a beta nextflow repository BamQC-nf that has modules for SAMtools stats, SAMtools flagstats, Qualimap and Mosdepth. Qualimap gives very comprehensive results however is resource intensive so may be prohibitively costly for large cohorts. SAMtools stats gives detailed output for a very small resource footprint.

The table below shows the optiomal benchmarked resources for the NA12890 Platinum Genomes sample:

Job CPU Queue Mem_requested Mem_used CPUtime_mins Walltime_mins CPU_Efficiency Service_units
samtools_flagstats 4 normalbw 36.0GB 35.19GB 33.93 8.75 0.97 0.73
samtools_stats 2 normal 8.0GB 8.0GB 56.8 31.67 0.9 2.11
mosdepth 2 normal 8.0GB 8.0GB 18.95 11 0.86 0.73
qualimap 14 normalbw 126.0GB 98.14GB 319.57 82.18 0.28 23.97

Users are encouraged to either use repository BamQC-nf or take a copy of a trio of scripts from one of the parallel steps of this workflow and modify to run the desired BAM-QC tool, using the above benchmarks as a guide for setting up resources.

8. BAM QC multiQC

To run multiqc over BAM-QC outputs, simply specify the BAM-QC results directories as well as the desired output directory. For large cohorts, it may be necessary to submit this to the scheduler. For smaller cohorts, running on the login node is fast and light enough.

module load multiqc/1.9
multiqc <bamqcdir1> <bamqcdir2> -o <BAM-QC-out>

9. Create per sample gVCF

10. Joint genotype