ToolsVanBox / PTATO

PTa Analysis TOolkit
MIT License
9 stars 3 forks source link

PTATO

DOI

DOI

The PTA Analysis TOolbox (PTATO) is a comprehensive pipeline designed to filter somatic single base substitutions (SBS), small insertions and deletions (indels) and structural variants (SVs) from PTA-based single-cell whole genome sequencing (WGS) data. More information about the pipeline can be found in the manuscript. Please cite the manuscript if you use PTATO.

Dependencies

PTATO is implemented in Nextflow and required installation of the following dependencies:

R libraries

Easy Installation (Linux) (recommended)

Download Singularity image SIF (~4gb) Required singularity/Apptainer v1.1.3

# 1. Pull singularity image from Docker bootstrap
singularity pull ptato_1.2.0.sif docker://vanboxtelbioinformatics/ptato:1.2.0

# 2. Clone PTATO repository 
git clone git@github.com:ToolsVanBox/PTATO.git

# 3. Run with singularity exec
singularity exec ptato_1.2.0.sif /ptato/nextflow/nextflow run \
[PTATO_dir]/ptato.nf \
-c [PTATO_dir]configs/run_template.config \
-profile slurm -resume

Installing & Setup

  1. Install Nextflow
  2. Install Singularity
  3. Install R and the required R libraries
  4. Pull/Clone PTATO
  5. Download and extract the required resource files

Pull or Clone

git clone git@github.com:ToolsVanBox/PTATO.git

Demo dataset

A test dataset containing all required input files (e.g. BAM, VCF and config files) to run PTATO is available for download here. This demo dataset contains bulk whole genome sequencing (WGS) data of a clonal cell line and a PTA dataset from a single cell derived from this clone. The bulk WGS data can be used as germline control sample.

Resources

Most required resource files (for the hg38 reference genome) are already included in the PTATO repository. Only the reference genome and the SHAPEIT resources need to be downloaded seperately. First extract the following resources files:

Reference genome

Please download the reference genome fasta file. Must have the following files:

Recommended files are, otherwise they will be created by the pipeline:

And put them in this folder [PTATO_dir]/resources/hg38/

SHAPEIT resources

Download the following SHAPEIT resource files:

Unzip the genetic_maps.b38.tar.gz by using the following command

tar -zxf genetic_maps.b38.tar.gz

And put them in this folder, respectively:

Running the workflow

To run the PTATO workflow, the following steps have to be performed:

  1. Collect input files
  2. Change the run-template.config
  3. Start the pipeline

1. Collect input files

PTATO requires the following input files:

  1. BAM files
  2. VCF file (multi-sample) generated by GATK

Optionally, to generate basic quality control plots, PTATO also makes use of the following files:

  1. wgs_metrics.txt
  2. multiple_metrics.alignment_summary_metrics
    • If you do not have the wgs_metrics and alignment_summary_metrics files, change QC = true to QC = false in the run.config file.

1. BAM files

2. Multisample VCF file generated by GATK

3. WGS Metrics files (Optional)

4. Alignment Summary Metrics (Optional)

Input folder structure

Currently, PTATO requires a strict structure of input directories (eg the bam files should be placed in a subdirectory with the name of the individual/donor/patient). It is possible to use links to the original files (bam/vcf), as long as these links are in the appropriate folder structure. The paths to these directories should be included in the run.config file (see below). The input files listed above should be structured as follows:

/path/to/vcfs_dir
  ./Donor_1
    ./myfile.vcf(.gz)
/path/to/bams_dir
  ./Donor_1
    ./mycontrol.bam
    ./mysample1.bam
    ./mysample2.bam
    ...
/path/to/wgs_metrics
  ./Donor_1
    ./wgs_metrics1.txt
    ./wgs_metrics2.txt
    ...
/path/to/alignment_summary_metrics
  ./Donor_1
    ./alignment_summary_metrics1
    ./alignment_summary_metrics2
    ...

2. Change the run-template.config

PTATO uses 4 different config files (templates provided in the [PTATO_dir]/configs/ directory):

  1. run.config
  2. process.config
  3. nextflow.config
  4. resources.config

The run.config needs to be adjusted for each new PTATO run. The process.config may have to be changed if necessary. The nextflow.config has to be changed once (and can be reused for later runs), to tailor the settings specific for your compute cluster.

1. run.config

The run.config contains the paths to the input files and therefore needs to be adapted for each PTATO run.

The first three lines of the config file should contain the paths to the other three config files, as follows:

includeConfig "${projectDir}/configs/process.config"
includeConfig "${projectDir}/configs/nextflow.config"
includeConfig "${projectDir}/configs/resources.config"

Change the paths if you use different versions of the config files, stored at different locations (for example if you have a separate process.config file for a specific PTATO run

All of the parameters in the params section can also be supplied on the commandline or can be pre-filled in the run.config file:

includeConfig "${projectDir}/configs/process.config"
includeConfig "${projectDir}/configs/nextflow.config"
includeConfig "${projectDir}/configs/resources.config"

params {

  run {
    snvs = true
    QC = true
    svs = false
    indels = true
    cnvs = false
  }

  // TRAINING
  train {
    version = '2.0.0'
  }
  pta_vcfs_dir = ''
  nopta_vcfs_dir = ''
  // END TRAINING

  // TESTING
  input_vcfs_dir = ''
  bams_dir = ''
  // END TESTING

  out_dir = ''
  bulk_names = [
    ['donor_id', 'sample_id'],
  ]

  snvs {
    rf_rds = "${projectDir}/resources/hg38/snvs/randomforest/randomforest_v1.0.0.rds"
  }

  indels {
    rf_rds = ''
    excludeindellist = "${projectDir}/resources/hg38/indels/excludeindellist/PTA_Indel_ExcludeIndellist_normNoGTrenamed.vcf.gz"
  }
  optional {

    germline_vcfs_dir = ''

    short_variants {
      somatic_vcfs_dir = ''
      walker_vcfs_dir = ''
      phased_vcfs_dir = ''
      ab_tables_dir = ''
      context_beds_dir = ''
      features_beds_dir = ''
    }

    snvs {
      rf_tables_dir = ''
      ptato_vcfs_dir = ''
    }

    indels {
      rf_tables_dir = ''
      ptato_vcfs_dir = ''
    }

    qc {
      wgs_metrics_dir = ''
      alignment_summary_metrics_dir = ''
    }

    svs {
      gridss_driver_vcfs_dir = ''
      gridss_unfiltered_vcfs_dir = ''
      gripss_somatic_filtered_vcfs_dir = ''
      gripss_filtered_files_dir = ''
      integrated_sv_files_dir = ''
    }

    cnvs {
      cobalt_ratio_tsv_dir = ''
      cobalt_filtered_readcounts_dir = ''
      baf_filtered_files_dir = ''
    }
  }

}
- Under header "bulk_names" you have to specify the name of the individual/donor/patient and the sample_id of the germline control sample. Mutations in this control sample are used to determine which variants are germline or somatic. Mutations in the control sample are excluded from the somatic variants. Multiple control samples can be specified by adding an additional row:

bulk_names = [ ['donor_id', 'control_sample1'], ['donor_id', 'control_sample2'], ]

- If you would like to run the sequencing QC, you have to specify the paths to the directories containing the wgs_metrics and alignment_summary_metrics files here:
qc {
  wgs_metrics_dir = '/path/to/wgs_metrics_dir'
  alignment_summary_metrics_dir = '/path/to/alignment_summary_metrics_dir'
}
- All other fields (that are empty in the example run.config file) are optional and can be left empty. These files will be generated by PTATO. If you would like to rerun parts of PTATO later, you can specify the files that were previously generated by PTATO. The old files will then be re-used, which saves time and resources.

#### 2. process.config
The process.config contains the general settings for each type of job. Here you can for example change the time and memory that are reserved for each job. This likely requires some tweaking (and trial and error) for your specific compute cluster setup. The required time and memory also depend on the number of samples you would like to include in your PTATO run. For example, if you would like to run PTATO on 10+ samples, you would likely need to increase the time for the somatic variant filtering (eg change `params.smurf.time = '4h'` to `params.smurf.time = '12h'`).

#### 3. nextflow.config
The nextflow.config has to be changed once to specify the base directory and cache dir for your cluster. Specifically, only this part needs to be changed:

singularity { enabled = true autoMounts = true runOptions = '-B /hpc -B $TMPDIR:$TMPDIR' cacheDir = '/hpc/local/CentOS7/pmc_vanboxtel/singularity_cache' }

- Change `/hpc` in `runOptions` to the base directory of your cluster
- Change the path in `cacheDir` to a cache directory on your cluster

### 3. Start the pipeline
Once you have collected all the input files and changed the required config files you can start the PTATO pipeline.
To start the pipeline on a Slurm workload manager:

/path/to/nextflow run /path/to/ptato.nf -c /path/to/run.config --out_dir /path/to/output_directory -profile slurm -resume



## Acknowledgements and References
Also see the references in the [manuscript](https://www.biorxiv.org/content/10.1101/2023.02.15.528636v1). PTATO includes the following external software:
- [GRIDSS2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02423-x) and [GRIPSS](https://github.com/hartwigmedical/hmftools/blob/master/gripss/README.md): Cameron, D.L., Baber, J., Shale, C., Valle-Inclan, J.E., Besselink, N., van Hoeck, A., Janssen, R., Cuppen, E., Priestley, P., and Papenfuss, A.T. (2021). GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing. Genome Biol 22, 1–25. 10.1186/s13059-021-02423-x
- [SHAPEIT4](https://www.nature.com/articles/s41467-019-13225-y): Delaneau, O., Zagury, J.-F., Robinson, M.R., Marchini, J.L., and Dermitzakis, E.T. (2019). Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436. 10.1038/s41467-019-13225-y
- [COBALT](https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md): Priestley, P., Baber, J., Lolkema, M.P., Steeghs, N., de Bruijn, E., Shale, C., Duyvesteyn, K., Haidari, S., van Hoeck, A., Onstenk, W., et al. (2019). Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216. 10.1038/s41586-019-1689-y)
- [CIRCOS](https://circos.co/): Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645

## Known issues and future development
- Currently only tested for the hg38 reference genome. Can in principle be run for other reference genomes as well, as long as the required input files are available (eg ShapeIt maps etc.)
- Currently only tested for slurm
- More documentation and code how to analyze/interpret PTATO output files will be added