JiangYuLab / CNVcaller

GNU General Public License v3.0
41 stars 14 forks source link

CNVcaller

CNVcaller is designed to detect copy number variation using sequencing data from populations.

Overview

CNVcaller is a program for detecting the integrated copy number veriation regions (CNVRs) using population sequencing data. The high-confidence CNVRs are discovered and refined by both individual and population criteria. The result is a VCF format genotype file which can be used in GWAS/QLT research. Based on our validation, CNVcaller can report CNVRs from large populations with more than 1000 individuals within one week on one computational node. It can be applied to complicated genomes such as wheat and pan-genome.

Installation and requirements

Requirements

The following software must be installed on your machine:

Installation

Clone the CNVcaller git:
git clone https://github.com/JiangYuLab/CNVcaller.git
Go to CNVcaller directory
cd CNVcaller

Install blasr

Since blasr has stopped updating, it is highly recommended to create a separate conda virtual environment for installation:

conda create -n blasr blasr=5.3.2
conda activate blasr

Test with example:

To grab sample data and test CNVcaller, please download it from here.

Running the program

CNVcaller contains four steps consisting one perl script, two bash scripts and one python script. You need to set CNVcaller variables in the three bash scripts based on your environment.

1. Indexing Reference Genome

The reference genome is segmented into overlapping sliding windows. The windows are indexed to form a reference database used in all samples. This commend will create the file referenceDB.windowsize in current directory by default.

$ perl CNVReferenceDB.pl <ref>
Required arguments
 <ref> Reference sequence
Optional arguments
 -w the window size (bp) for all samples [default=800]
 -l the lower limit of GC content [default=0.2]
 -u the upper limit of GC content [default=0.7]
 -g the upper limit of gap content [default=0.5]

Three default directories RD_raw RD_absolute RD_normalized will be created in current directory in order, containing the raw read depth, read depth after absolute copy number correction and the final GC corrected normalized read depth of each sample. The name of the normalized RD file indicates the average RD (mean), STDEV of the RD and the gender (1=XX/ZZ, 2=XY/ZW) of this sample. The final read depths are normalized to one.

This step consumes about 500 MB for each individual, multiple tasks can be run in parallel. Shell script Individual.Process.sh is provided to complete these procedures.

$ bash Individual.Process.sh -b <bam> -h <header> -d <dup> -s <sex_chromosome>
Required arguments 
 -b|--bam      alignment file in BAM format
 -h|--header   header of bam file, the prefix of output file
 -d|--dup      duplicated window record file used for absolute copy number correction
 -s|--sex      the name of sex chromosome

3. CNVR detection

The normalized RD files of all samples are piled up into a two-dimensional population RD file. The integrated CNVR are detected by scanning the population RD file with aberrantly RD, CNV allele frequency and significantly correlation with adjacent windows. The adjacent candidate windows showing high correlation will be further merged.

$ bash CNV.Discovery.sh -l <RDFileList> -e <excludedFileList> -f <frequency> -h <homozygous> -r <pearsonCorrelation> -p <primaryCNVR> -m <mergedCNVR>
Required arguments: 
-l|--RDFileList          individual normalized read depth file list
-e|--excludedFileList    list of samples exclude from CNVR detection
-f|--frequency           minimum frequency of gain/loss individuals for candidate CNV window definition
                         [recommend 0.1]
-h|--homozygous          number of homozygous gain/loss individuals for candidate CNV window definition 
                         [recommend 3]
-r|--pearsonCorrelation  minimum of Pearson’s correlation coefficient between the two adjacent non-overlapping windows
                         0.5 for sample size (0, 30]
                         0.4 for sample size (30, 50]
                         0.3 for sample size (50, 100]
                         0.2 for sample size (100, 200]
                         0.15 for sample size (200, 500]
                         0.1 for sample size (500,+∞)
-p|--primaryCNVR         primary CNVR result 
-m|--mergedCNVR          merged CNVR result

4. Genotyping

Clustering the input samples into genotypes uses Gaussian mixture modes. The output contain a genotype VCF - a VCF format file containing the input site descriptions, additional site-specific information and a called genotype for each input sample.

$ python Genotype.py --cnvfile <input> --outprefix <outfile prefix>
Required arguments:
--cnvfile          merged CNVR file
--outprefix        prefix of out files
Optional arguments:
--nproc            number of process will be used, default is one.

Generate your own duplicated window record file

It is highly recommended to generate the corresponding file for each chromosome separately and then concatenate the output files in the correct order.

Step 1: Split genome into short kmer sequences.

$ python 0.1.Kmer_Generate.py [OPTIONS] FAFILE WINSIZE OUTFILE

 Required arguments
 <FAFILE> Reference sequence in FASTA format
 <WINSIZE> The size of the window to use for CNV calling
 <OUTFILE> Output kmer file in FASTA format

Example: python 0.1.Kmer_Generate.py reference.fa 800 kmer.fa

Step 2: Align the kmer FASTA (from step 1) to reference genome using blasr sawriter and in the conda blasr environment.

Create the reference.fa.sa file:

Example: sawriter reference.fasta.sa reference.fasta

Align the kmer FASTA to reference genome:

Example: blasr kmer.fa reference.fa --sa reference.fa.sa --out kmer.aln -m 5 --noSplitSubreads --minMatch 15 --maxMatch 20 --advanceHalf --advanceExactMatches 10 --fastMaxInterval --fastSDP --aggressiveIntervalCut --bestn 10

Step 3: Generate duplicated window record file.

$ python 0.2.Kmer_Link.py [OPTIONS] BLASR WINSIZE OUTFILE

 Required arguments
 <BLASR> blasr results (-m 5 format)
 <WINSIZE> The size of the window to use for CNV calling
 <OUTFILE> Output genome duplicated window record file

Example: python 0.2.Kmer_Link.py kmer.aln 800 window.link

Contact

Any questions, bug reports and suggestions can be posted to Email:
yu.jiang@nwafu.edu.cn

CNVcaller has a paper

https://academic.oup.com/gigascience/advance-article/doi/10.1093/gigascience/gix115/4689116

Citation

Xihong Wang, Zhuqing Zheng, Yudong Cai, Ting Chen, Chao Li, Weiwei Fu, Yu Jiang; CNVcaller: Highly Efficient and Widely Applicable Software for Detecting Copy Number Variations in Large Populations, GigaScience, , gix115, https://doi.org/10.1093/gigascience/gix115