HenrikBengtsson / CostelloPSCNSeq

R package: Parent-specific Copy-number Estimation Pipeline using HT-Seq Data
3 stars 2 forks source link
copy-number ht-seq r

CostelloPSCNSeq: Parent-Specific Copy-Number Estimation Pipeline using HT-Seq Data

Requirements

Required data

This pipeline requires paired tumor-normal data. This is because the allele-specific copy numbers are inferred from the allelic imbalance in the tumor at heterozygous SNPs. The heterozygous SNPs are identified from the allelic signals in the matched normal.

Required software

This pipeline is implemented in R and requires R packages aroma.seq, sequenza (Favero et al. 2015), and PSCBS (Bengtsson et al. 2010, Olshen et al. 2011). To install these packages and all of their dependencies to the current working directory, call the following from R:

if (!requireNamespace("pak")) install.packages("pak")
pak:::proj_create()
pak::pkg_install("sequenza@2.1.2")
pak::pkg_install("HenrikBengtsson/aroma.seq")
pak::pkg_install("HenrikBengtsson/CostelloPSCNSeq@develop")

In addition to the above R dependencies, the pipeline requires that samtools (Li et al. 2009) is on the PATH.

Important: The pipeline does not work with sequenza (>= 3.0.0; 2019-05-09). This is the reason why we install sequenza 2.1.2. The use of pak:::proj_create() causes all packages to be installed to ./r-packages/, which avoids clashing the R package library that is typically installed under ~/R/.

Setup (once)

  1. Run Rscript 0.setup.R once. This will setup links to shared annotation data sets and lab data files on the TIPCC compute cluster.

  2. Make sure ./config.yml is correct. It specify the default analysis settings. The individual entries can be overridden by individual command-line options to the below Rscript calls.

  3. Configure parallel processing following the instructions in Section 'Configure parallel processing' below.

Data processing

The following scripts should be run in order:

Rscript -e CostelloPSCNSeq::pscnseq --args --what=mpileup  --verbose=TRUE  # ~25 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=sequenza --verbose=TRUE  # ~60 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=pscbs    --verbose=TRUE  #  ~5 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=reports  --verbose=TRUE  #  ~2 min

Comment: The timings listed are typical run times for our test tumor-normal sample.

You may want to adjust inst/config.yml to process other data sets. Alternatively, you can specify another file that this default via command-line option --config, e.g. Rscript -e "CostelloPSCNSeq::pscnseq(what='mpileup', verbose=TRUE)" --config=config_set_a.yml.

Data processing via scheduler

To process the above four steps via the Torque/PBS scheduler, use:

$ qsub -d $(pwd) inst/scripts/1-4.submit_all.pbs

This will in turn submit the corresponding PBS scripts 1.mpileup.pbs, 2.sequenza.pbs, 3.pscbs.pbs, and 4.reports.pbs to the scheduler. Those PBS scripts are located in inst/scripts/ and "freeze" software versions to R 3.6.1 and samtools 1.3.1.

Configure parallel processing

The pipeline supports both sequential and parallel processing on a large number of backends and compute resources. By default the pipeline is configured to process the data sequentially on the current machine, but this can easily be changed to run in parallel, say, on a compute cluster. In order not to clutter up the analysis scripts, these settings are preferably done in a separate .future.R (loaded automatically by the future framework) in the project root directory.

To process data via a TORQUE / PBS job scheduler using the future.batchtools package, try with the configuration that we use for our UCSF TIPCC cluster;

# Copy to project directory
$ cp inst/future-configs/batchtools/.future.R .

# Copy to project directory
$ cp inst/future-configs/batchtools/batchtools.torque.tmpl .

# Install the future.batchtools package
$ Rscript -e "install.packages('future.batchtools')"

These should be generic enough to also run on other TORQUE / PBS systems. If not, see the batchtools package for how to configure the template file.

You can verify that it works by trying the following in the project directory:

> library("future")
Using future plan:
plan(list(samples = tweak(batchtools_torque, label = "sample", 
    resources = list(vmem = "4gb")), chromosomes = tweak(batchtools_torque, 
    label = "chr", resources = list(vmem = "8gb"))))

This confirms that as soon as the future package is loaded, it will source the .future.R script which in turn will setup the parallel settings. It is .future.R that reports on the future plan used.

Next, we can try to submit a job to the scheduler using these settings by:

> x %<-% Sys.info()[["nodename"]]

In this step, future.batchtools will import the batchtools.torque.tmpl file in that we copied to project working directory. If it fails to locate that file, there will be an error. If it succeeds, a batchtools job will be submitted to the job scheduler - which can be seen when if calling qstat -u $USER in another shell.

Finally, if we try to look at the value of x;

> x
[1] "n17"
> 

it will block until the job is finished and then its value will be printed. Here we see that the job was running on compute node n17.

Help

> help("pscnseq", package = "CostelloPSCNSeq")

pscnseq            package:CostelloPSCNSeq             R Documentation

Calling the Parent-Specific Copy-Number Pipeline Step by Step

Description:

     Calling the Parent-Specific Copy-Number Pipeline Step by Step

Usage:

     pscnseq(
       what = c("mpileup", "sequenza", "pscbs", "reports"),
       dataset = NULL,
       organism = NULL,
       chrs = NULL,
       samples = NULL,
       fasta = NULL,
       gcbase = NULL,
       bam_pattern = NULL,
       binsize = NULL,
       config = "config.yml",
       session_details = !interactive(),
       verbose = TRUE,
       ...
     )

Arguments:

    what: (character) The step to be performed; in order, one of
          '"mpileup"', '"sequenza"', '"pscbs"', or '"reports"'.

 dataset: (character) The name of the dataset as on file.

organism: (character) The name of the organism as on file.

    chrs: (character vector) The name of the chromosomes to be
          processed, e.g. 'c("1", "2", "X")'.

 samples: (character) Pathname to a tab-delimited sample specification
          file, typically named '*.tsv', e.g. 'samples.tsv'.

   fasta: (character) The pathname to the FASTA reference file,
          typically named '*.fa' or '*.fasta', e.g. 'hg19.fa'.

  gcbase: (character) The pathname to the FASTA reference file,
          typically named '*.txt.gz', e.g. 'hg19.gc50Base.txt.gz'.

bam_pattern: (character; optional) Regular expression to identify
          subset of BAM files to be processed.  If NULL (default), then
          BAM files matching .bwa.realigned.rmDups(|.recal)(|.bam)$ are
          included.

 binsize: (integer or numeric) The bin size (in basepairs) used for
          binning reads into bins that then are passed to the
          segmentation method.

  config: (character) Pathname to YAML configuration file. If NULL,
          then the configuration file is skipped.

session_details: (logical) If TRUE, session details are reported before
          starting the processing and after it completed.

 verbose: (logical) If TRUE, then verbose output is produced, otherwise
          not.

     ...: Not used.

Value:

     Returns what the called 'pscnseq_nnn()' function returns, i.e.
     'pscnseq_mpileup()', 'pscnseq_sequenza()', 'pscnseq_pscbs()', or
     'pscnseq_reports()'.

Format of the samples file:

     The 'samples' argument should specify the pathname to a
     TAB-delimited file that provide annotation data for the samples to
     be processed. This file should a row of TAB-delimited column
     headers followed rows of samples with corresponding, TAB-delimited
     cells. The samples file must provide columns 'Patient_ID',
     'Sample_ID', and 'A0'. Any other columns are ignored. This
     pipeline processes tumor-normal pairs.  The pairs processed are
     inferred from (Patient_ID, Sample_ID).  Specifically, for each
     unique 'Patient_ID', the sample entry with 'Sample_ID == "Normal"'
     is used as the normal reference.  There must only be such entry
     per patient. Each patient may have one or more tumor samples,
     which are identified as 'Sample_ID != "Normal"'.

     For example, the below 'samples.tsv' file specifies two
     tumor-normal pairs 'Primary-v1' vs 'Normal' and 'Primary-v2' vs
     'Normal' for one patient named 'Patient123'.  This file specifies
     also fields 'SF', 'Kit', and 'A0', which may be used in other
     pipelines but are all ignored by this pipeline.

     Patient_ID      Sample_ID       SF      Kit     A0
     Patient123      Normal  SF00121N        Xgen Exome Research Panel       X00001
     Patient123      Primary-v1      SF00121-v1      Xgen Exome Research Panel       X00002
     Patient123      Primary-v2      SF00121-v2      Xgen Exome Research Panel       X00003

     This

Configuration File:

     The default arguments can be set in an YAML-formatted
     configuration file as given by argument 'config'.  The default is
     to look for a file named 'config.yml' in the current directory.
     To skip this file, specify 'config = NULL'.  An example of such a
     file is:

     organism: Homo_sapiens
     chromosomes: c(1:22, "X", "Y", "M")
     fasta: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa
     gcbase: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.gc50Base.txt.gz
     dataset: CostelloP_2015-Exome,bwa,realigned,rmDups,recal
     binsize: 100e3
     samples: sampleData/samples.tsv

Specifying arguments via command-line options:

     The arguments can be overridden by command-line options, e.g.
     '--organism=Homo_sapiens' will take precedence of argument
     'organism', which in turn will take precedent of what is specified
     in the configuration file.

How to call pipeline from the command line:

     Below is how you could run the pipeline step by step.  The
     '--args' option tells 'Rscript' that any options following should
     be passed as arguments to this function.

     Rscript -e CostelloPSCNSeq::pscnseq --args --help
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=mpileup   # ~25 min
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=sequenza  # ~60 min
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=pscbs     #  ~5 min
     Rscript -e CostelloPSCNSeq::pscnseq --args --what=reports   #  ~2 min

References

Installation

R package CostelloPSCNSeq is only available via GitHub and can be installed in R as:

remotes::install_github("HenrikBengtsson/Costello-PSCN-Seq", ref="master")

Pre-release version

To install the pre-release version that is available in Git branch develop on GitHub, use:

remotes::install_github("HenrikBengtsson/Costello-PSCN-Seq", ref="develop")

This will install the package from source.

Contributions

This Git repository uses the Git Flow branching model (the git flow extension is useful for this). The develop branch contains the latest contributions and other code that will appear in the next release, and the master branch contains the code of the latest release.

Contributing to this package is easy. Just send a pull request. When you send your PR, make sure develop is the destination branch on the CostelloPSCNSeq repository. Your PR should pass R CMD check --as-cran, which will also be checked by and when the PR is submitted.

We abide to the Code of Conduct of Contributor Covenant.

Software status

Resource GitHub GitHub Actions Travis CI AppVeyor CI
Platforms: Multiple Multiple Linux & macOS Windows
R CMD check
Test coverage