The pipeline is designed to perform joint variant calling on large Plasmodium
Whole Genome Sequencing (WGS) datasets. It follows GATK
best practices and the
MalariaGEN Pf6 data-generating
methods.
Briefly, raw reads are first mapped to the human GRCh38 reference genome to
remove host reads, with the remaining reads being mapped to the Pf3D7 reference
genome (PlasmoDB_44). The mapped reads are then processed using GATK
's
MarkDuplicates
and BaseRecalibrator
tools. Following this, analysis-ready
mapped reads for each isolate are used to generate per-sample calls
(HaplotypeCaller
/GVCF mode). These per-sample calls are combined and run
through a joint-call step (GenotypeGVCFs
) to obtain unfiltered multi-sample
VCFs. A machine learning-based variant filtration strategy (VQSR via GATK
's
VariantRecalibrator
), or/and hard-filteration strategy, can then be used to
retain high-quality variants.
The main
branch of the repository is employed for Plasmodium falciparum.
However, the pipeline is expected to work with other Plasmodium species,
provided the corresponding configuration and reference files are given (See
nextflow.config). A separate vivax
branch with configuration and reference
files for Plasmodium vivax under development and can be checked with the
link.
The pipeline has been tested on MacOS and Linux system. The software
dependencies (including the version numbers of used software) are defined within the
env/nf.yaml
Conda recipe (for Nextflow engine) and the env/snp_call_nf.yaml
Conda recipe (for the pipeline itself). Installtion instruction is listed below.
The estimated time of instation is about 5-20 minutes.
Change to a working folder that is large enough to store the snp call result files. Git clone the pipeline and change directory to the pipeline folder
cd YOUR_WORKING_DIR # replace `YOUR_WORKING_DIR` with your real path
git clone https://github.com/bguo068/snp_call_nf.git
cd snp_call_nf
Install conda from here if you have not
Install the nf
and the snp_call_nf
conda environments
# NOTE: The nextflow engine and the pipeline may need different version of java.
# We use two different Conda environments to address the conflict.
# install nextflow
conda env create -f env/nf.yaml
# install snp_call_nf
conda env create -f env/snp_call_nf.yaml
Link the reference files (internal users) or prepare them by yourself (external users)
ln -s /local/projects-t3/toconnor_grp/bing.guo/ref/* ref/
or
ln -s /local/projects-t2/CVD/Takala-Harrison/Cambodia_Bing/ref/* ref/
ln -s /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/* ref/
cd ref
conda activate snp_call_nf
python3 prep_ref_files.py
conda deactivate
cd ..
conda activate nf; nextflow main.nf
test_data
foldercd ena_data; ena_data/download_ena_data.sh
(it may take hours to download all the
real data from ENA). Once downloaded, change directory to project folder
(where main.nf
is located) by run cd ..
, and the pipeline can be run
with conda activate nf; nextflow main.nf --fq_map ena_data/ena_fastq_map.tsv
conda activate nf; nextflow main.nf -profile sge
test_data
folderfastq_map.tsv
file to include the raw
reads(fastq.gz
files) of your own samples.Split chromosomes to better parallelize joint call:
--split intervals
to split the chromosome into more
intervals.Enable vqsr
variant filtering. By default, vqsr
is not enabled. To enable
this option, you can specify --vqsr true
to the nextflow command line
./fastq_map.tsv
HostId
is the index of host genomes from 0, see params.host
in nextflow.config fileMateId
can be 0 for single-end sequencing, or 1 and 2 for pair-end sequencing./nextflow.config
clusterOptions = "-P toconnor-lab -cwd -V"
to reflect your lab specifc sge qsub option./main.nf
result/read_length
folder: report the raw read length for each samples/runsresult/flagstat_host
and result/flagstat_parasite
: flagstat of
aligned reads (aligned with host genome and parasite genome respectively)result/recalibrated
: analysis ready bam filesresult/coverage
: read converage based on the analysis-ready bam files result/flagstat
: bam flatstat based on the analysis-ready bam files result/gvcf
: single-sample vcf filesresult/hardfilt
: multiple-sample (joint-call) vcf files with hard filterating annotationsresult/vqsrfilt
: multiple-sample (joint-call) vcf files with vqsr-based filterating annotations.
You can decide to use one of these, result/hardfilt
and result/vqsrfilt
.This pipeline was originally developed for the posseleff
project.
If you find this pipeline useful, please consider citing our preprint:
Guo, B., Borda, V., Laboulaye, R., Spring, M. D., Wojnarski, M., Vesely, B. A., Silva, J. C., Waters, N. C., O'Connor, T. D., & Takala-Harrison, S. (2023). Strong Positive Selection Biases Identity-By-Descent-Based Inferences of Recent Demography and Population Structure in Plasmodium falciparum. bioRxiv : the preprint server for biology, 2023.07.14.549114. https://doi.org/10.1101/2023.07.14.549114