The PTA Analysis TOolbox (PTATO) is a comprehensive pipeline designed to filter somatic single base substitutions (SBS), small insertions and deletions (indels) and structural variants (SVs) from PTA-based single-cell whole genome sequencing (WGS) data. More information about the pipeline can be found in the manuscript. Please cite the manuscript if you use PTATO.
PTATO is implemented in Nextflow and required installation of the following dependencies:
Download Singularity image SIF (~4gb) Required singularity/Apptainer v1.1.3
# 1. Pull singularity image from Docker bootstrap
singularity pull ptato_1.2.0.sif docker://vanboxtelbioinformatics/ptato:1.2.0
# 2. Clone PTATO repository
git clone git@github.com:ToolsVanBox/PTATO.git
# 3. Run with singularity exec
singularity exec ptato_1.2.0.sif /ptato/nextflow/nextflow run \
[PTATO_dir]/ptato.nf \
-c [PTATO_dir]configs/run_template.config \
-profile slurm -resume
git clone git@github.com:ToolsVanBox/PTATO.git
A test dataset containing all required input files (e.g. BAM, VCF and config files) to run PTATO is available for download here. This demo dataset contains bulk whole genome sequencing (WGS) data of a clonal cell line and a PTA dataset from a single cell derived from this clone. The bulk WGS data can be used as germline control sample.
Most required resource files (for the hg38 reference genome) are already included in the PTATO repository. Only the reference genome and the SHAPEIT resources need to be downloaded seperately. First extract the following resources files:
[PTATO_dir]/resources/hg38/gripss/gridss_pon_breakpoint.tar.gz
[PTATO_dir]/resources/hg38/cobalt/COBALT_PTA_Normalized_Full.tar.gz
[PTATO_dir]/resources/hg38/smurf/Mutational_blacklists/Fetal_15x_raw_variants_hg38.tar.gz
[PTATO_dir]/resources/hg38/smurf/Mutational_blacklists/MSC_healthyBM_raw_variants_hg38.tar.gz
Please download the reference genome fasta file. Must have the following files:
Recommended files are, otherwise they will be created by the pipeline:
And put them in this folder [PTATO_dir]/resources/hg38/
Download the following SHAPEIT resource files:
Unzip the genetic_maps.b38.tar.gz by using the following command
tar -zxf genetic_maps.b38.tar.gz
And put them in this folder, respectively:
[PTATO_dir]/resources/hg38/shapeit/Phasing_reference/
[PTATO_dir]/resources/hg38/shapeit/shapeit_maps/
To run the PTATO workflow, the following steps have to be performed:
PTATO requires the following input files:
Optionally, to generate basic quality control plots, PTATO also makes use of the following files:
QC = true
to QC = false
in the run.config file.wgs_metrics.txt
to get an overview of the genome coverage in the input samples. Each sample requires its own seperate txt file.alignment_summary_metrics
files to generate an overview of the number of sequencing reads in the input samples. Each sample requires its own seperate summary file.Currently, PTATO requires a strict structure of input directories (eg the bam files should be placed in a subdirectory with the name of the individual/donor/patient). It is possible to use links to the original files (bam/vcf), as long as these links are in the appropriate folder structure. The paths to these directories should be included in the run.config
file (see below). The input files listed above should be structured as follows:
/path/to/vcfs_dir
./Donor_1
./myfile.vcf(.gz)
/path/to/bams_dir
./Donor_1
./mycontrol.bam
./mysample1.bam
./mysample2.bam
...
/path/to/wgs_metrics
./Donor_1
./wgs_metrics1.txt
./wgs_metrics2.txt
...
/path/to/alignment_summary_metrics
./Donor_1
./alignment_summary_metrics1
./alignment_summary_metrics2
...
PTATO uses 4 different config files (templates provided in the [PTATO_dir]/configs/
directory):
run.config
process.config
nextflow.config
resources.config
The run.config
needs to be adjusted for each new PTATO run. The process.config
may have to be changed if necessary. The nextflow.config
has to be changed once (and can be reused for later runs), to tailor the settings specific for your compute cluster.
The run.config contains the paths to the input files and therefore needs to be adapted for each PTATO run.
The first three lines of the config file should contain the paths to the other three config files, as follows:
includeConfig "${projectDir}/configs/process.config"
includeConfig "${projectDir}/configs/nextflow.config"
includeConfig "${projectDir}/configs/resources.config"
Change the paths if you use different versions of the config files, stored at different locations (for example if you have a separate process.config file for a specific PTATO run
All of the parameters in the params section can also be supplied on the commandline or can be pre-filled in the run.config file:
includeConfig "${projectDir}/configs/process.config"
includeConfig "${projectDir}/configs/nextflow.config"
includeConfig "${projectDir}/configs/resources.config"
params {
run {
snvs = true
QC = true
svs = false
indels = true
cnvs = false
}
// TRAINING
train {
version = '2.0.0'
}
pta_vcfs_dir = ''
nopta_vcfs_dir = ''
// END TRAINING
// TESTING
input_vcfs_dir = ''
bams_dir = ''
// END TESTING
out_dir = ''
bulk_names = [
['donor_id', 'sample_id'],
]
snvs {
rf_rds = "${projectDir}/resources/hg38/snvs/randomforest/randomforest_v1.0.0.rds"
}
indels {
rf_rds = ''
excludeindellist = "${projectDir}/resources/hg38/indels/excludeindellist/PTA_Indel_ExcludeIndellist_normNoGTrenamed.vcf.gz"
}
optional {
germline_vcfs_dir = ''
short_variants {
somatic_vcfs_dir = ''
walker_vcfs_dir = ''
phased_vcfs_dir = ''
ab_tables_dir = ''
context_beds_dir = ''
features_beds_dir = ''
}
snvs {
rf_tables_dir = ''
ptato_vcfs_dir = ''
}
indels {
rf_tables_dir = ''
ptato_vcfs_dir = ''
}
qc {
wgs_metrics_dir = ''
alignment_summary_metrics_dir = ''
}
svs {
gridss_driver_vcfs_dir = ''
gridss_unfiltered_vcfs_dir = ''
gripss_somatic_filtered_vcfs_dir = ''
gripss_filtered_files_dir = ''
integrated_sv_files_dir = ''
}
cnvs {
cobalt_ratio_tsv_dir = ''
cobalt_filtered_readcounts_dir = ''
baf_filtered_files_dir = ''
}
}
}
run { }
you can specify which parts of PTATO you would like to run (set to = true
). For example, if you don't want to run SV calling and filtering, you can set svs = false
. Note: the snvs = true
and cnvs = true
parts of PTATO are required to run the svs = true
part.// TESTING
you have to specify the paths to the input directories containing the VCF file (input_vcfs_dir = '/path/to/vcf/'
) and the BAM files (bams_dir = '/path/to/bams/'
). Please not that the name of the individual/donor should not be included in the path (eg NOT: '/path/to/vcfs_dir/donor/'
)
// TESTING
input_vcfs_dir = '/path/to/vcfs_dir/'
bams_dir = '/path/to/bams_dir/'
// END TESTING
- Under header "bulk_names" you have to specify the name of the individual/donor/patient and the sample_id of the germline control sample. Mutations in this control sample are used to determine which variants are germline or somatic. Mutations in the control sample are excluded from the somatic variants. Multiple control samples can be specified by adding an additional row:
bulk_names = [ ['donor_id', 'control_sample1'], ['donor_id', 'control_sample2'], ]
- If you would like to run the sequencing QC, you have to specify the paths to the directories containing the wgs_metrics and alignment_summary_metrics files here:
qc {
wgs_metrics_dir = '/path/to/wgs_metrics_dir'
alignment_summary_metrics_dir = '/path/to/alignment_summary_metrics_dir'
}
- All other fields (that are empty in the example run.config file) are optional and can be left empty. These files will be generated by PTATO. If you would like to rerun parts of PTATO later, you can specify the files that were previously generated by PTATO. The old files will then be re-used, which saves time and resources.
#### 2. process.config
The process.config contains the general settings for each type of job. Here you can for example change the time and memory that are reserved for each job. This likely requires some tweaking (and trial and error) for your specific compute cluster setup. The required time and memory also depend on the number of samples you would like to include in your PTATO run. For example, if you would like to run PTATO on 10+ samples, you would likely need to increase the time for the somatic variant filtering (eg change `params.smurf.time = '4h'` to `params.smurf.time = '12h'`).
#### 3. nextflow.config
The nextflow.config has to be changed once to specify the base directory and cache dir for your cluster. Specifically, only this part needs to be changed:
singularity { enabled = true autoMounts = true runOptions = '-B /hpc -B $TMPDIR:$TMPDIR' cacheDir = '/hpc/local/CentOS7/pmc_vanboxtel/singularity_cache' }
- Change `/hpc` in `runOptions` to the base directory of your cluster
- Change the path in `cacheDir` to a cache directory on your cluster
### 3. Start the pipeline
Once you have collected all the input files and changed the required config files you can start the PTATO pipeline.
To start the pipeline on a Slurm workload manager:
/path/to/nextflow run /path/to/ptato.nf -c /path/to/run.config --out_dir /path/to/output_directory -profile slurm -resume
## Acknowledgements and References
Also see the references in the [manuscript](https://www.biorxiv.org/content/10.1101/2023.02.15.528636v1). PTATO includes the following external software:
- [GRIDSS2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02423-x) and [GRIPSS](https://github.com/hartwigmedical/hmftools/blob/master/gripss/README.md): Cameron, D.L., Baber, J., Shale, C., Valle-Inclan, J.E., Besselink, N., van Hoeck, A., Janssen, R., Cuppen, E., Priestley, P., and Papenfuss, A.T. (2021). GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing. Genome Biol 22, 1–25. 10.1186/s13059-021-02423-x
- [SHAPEIT4](https://www.nature.com/articles/s41467-019-13225-y): Delaneau, O., Zagury, J.-F., Robinson, M.R., Marchini, J.L., and Dermitzakis, E.T. (2019). Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436. 10.1038/s41467-019-13225-y
- [COBALT](https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md): Priestley, P., Baber, J., Lolkema, M.P., Steeghs, N., de Bruijn, E., Shale, C., Duyvesteyn, K., Haidari, S., van Hoeck, A., Onstenk, W., et al. (2019). Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216. 10.1038/s41586-019-1689-y)
- [CIRCOS](https://circos.co/): Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645
## Known issues and future development
- Currently only tested for the hg38 reference genome. Can in principle be run for other reference genomes as well, as long as the required input files are available (eg ShapeIt maps etc.)
- Currently only tested for slurm
- More documentation and code how to analyze/interpret PTATO output files will be added