diazlab / CONICS

CONICS: COpy-Number analysis In single-Cell RNA-Sequencing
73 stars 28 forks source link

CONICS

CONICS: COpy-Number analysis In single-Cell RNA-Sequencing

CONICS works with either full transcript (e.g. Fluidigm C1) or 5'/3' tagged (e.g. 10X Genomics) data!

The CONICS paper has been accepted for publication in Bioinformatics. Check it out here !

Table of contents

CONICSmat - Identifying CNVs from scRNA-seq using a count table

CONICSmat is an R package that can be used to identify CNVs in single cell RNA-seq data from a gene expression table, without the need of an explicit normal control dataset. CONICSmat works with either full transcript (e.g. Fluidigm C1) or 5'/3' tagged (e.g. 10X Genomics) data. A tutorial on how to use CONICSmat, and a Smart-Seq2 dataset, can be found on the CONICSmat Wiki page [CLICK here].

overview Visualizations of scRNA-seq data from Oligodendroglioma (Tirosh et al., 2016) generated with CONICSmat.



CONICS - Identifying CNVs from scRNA-seq with alignment files

Requirements

Config file

Adjust CONICS.cfg to customize the following:

Test data

A set of tumor cells from three glioblastoma patients and a normal brain control, as well as a file with genomic coordinate of large-scale CNVs are available here.

Running

  bash run_CONICS.sh [directory for tumor] [directory for normal] [.bed file for CNV segments] [base name]

Output

All output files will be located in the directory _output[base name]__.

  1. incidenceMatrix.csv: matrix of presence/absence for all CNVs, in individual cells
  2. Read-count distribution in CNV segments. (violin plot)
  3. Hierarchical clustering of the single cells by CNV status.

violin dendrogram

Integrating estimates of point-mutation minor-allele frequencies

Regions of copy-number alteration will show a drop in the frequency of reads quantifying the minor allele. Averaged over large regions of copy-number alteration, this provides an additional metric to increase confidence in single-cell CNV-calls.

Requirements

Config file

Adjust CONICS.cfg to customize the following:

Running

  bash run_BAf_analysis.sh [directory for tumor] [VCF file for normal exome-seq] [VCF file for tumor exome-seq] [BED file for CNV segments] [base name]

Output

All output files will be located in the directory _output[base name]__

  1. _germline-snvs.bed: BED file containing position and BAFs from Exome-seq, generated in step 1.
  2. _af.txt: TAB separated table containing the counts for the A allele at each locus in each cell, generated in step 2
  3. _bf.txt: TAB separated table containing the counts for the B allele at each locus in each cell, generated in step 2
  4. baf_hist.pdf Hierarchical clustering of the average B allele frequency in each of the loci altered by copy number for each cell, generated in step 3

heatmap

Phylogenetic tree contruction

CONICS can generate a phylogenetic tree from the CNV incidence matrix, using the Fitch-Margoliash algorithm. Other phylogenetic reconstruction algorithms can be applied, using the incidence matrix as a starting point.

Requirements

Config file

Adjust Tree.cfg to change the following.

Running

Output

cluster.pdf (phylogenetic trees) and cluster.txt will be generated in the output directory. Each leaf corresponds to a clusters of cells with a common genotype. Cluster assignments for each cell will be in cluster.txt.

tree

cluster_1  D12,E10,F9,G3,A12,C8,C9,A3,A5,A6,C3,C2,C1,C7,H12,C4,D8,D9,A9,E4,E7,E3,F1,E1,B5,B7,E9,B3,D7,D1
cluster_2  E8,G7,G9,A7,G2,B6,E2
cluster_3  H3,A2,A4,H8,G11,F2,F3,H1,H7
cluster_4  A10,B2
cluster_5  C5
cluster_6  F8,B1

Intra-clone co-expression networks

CONICS can construct the local co-expression network of a given gene, based on correlations across single cells.

Requirements

Config file

Adjust CorrelationNetwork.cfg to configure the following:

Running

  bash run_CorrelationNetwork.sh [input matrix] [centered gene] [base name]

Output

All the output files will be located in output.

  1. __[correlstionthreshold][gene_name].txt__ : co-expression network
  2. __[gene_name]corMat.rd__: Rdata containing the adjusted correlation matrix
  3. topCorrelations.pdf: bar graph of top correlations.

CXnet

Assessing the correlation of CNV status with single-cell gene-expression

Requirements

Config file

Adjust CompareExomeSeq_vs_ScRNAseq.cfg to set the following:

Running

  bash run_compareExomeSeq_vs_ScRNAseq.sh [matrix for read counts] [base name for output file]

DDX11L1 1 11874 538 199 5 0 WASH7P 1 14362 4263 6541 223 45

  * __[base name]__ : base name for output directory

### Output
__Compare_[window_size].pdf__ (Box plot) will be generated in the output directory. 
![compare](images/Compare_200.jpg?raw=true "compare" )

## <a id="10x"></a> False discovery rate estimation: Cross validation
CONICS can estimate false discovery rate via 10-fold cross-validation, using the user-supplied control scRNA-seq dataset. For example, in the manuscript cross validation was performed using [normal brain controls](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835).

### Requirements
  * [beanplot](https://www.jstatsoft.org/article/view/v028c01)(R package)
  * [samtools](http://www.htslib.org)
  * [bedtools](http://bedtools.readthedocs.io/en/latest)  

### Config file
Adjust __10X_cross_validation.cfg__ to set the following:
  * Paths to python/samtools/bedtools/Rscript
  * Thresholds for mapping-quality and read-count.
  * FDR for CNV calling

### Running

bash run_10X_cross_validation.sh [directory for control scRNA-seq] [.bed file containing CNV segments] [base name]

  * __[directory for test]__: path to directory containing the aligned BAM files of the scRNA-seq control data.

  * __[.bed file for CNV segments], [base name]__ : same as described in run_CONICS.sh
;
### Output
Box plot of 10 FDRs resulting from each pooled sample would be generated (__boxplot.pdf__) in the output directory.
![10X](images/10X_boxplot.jpg?raw=true "10Xval_Test" )

## <a id="Empirical"></a> False discovery rate estimation: Empirical testing
FDRs can also be estimated by empirical testing. In the manuscript, the number of false positive CNV calls was calculated using a non-malignant [fetal brain dataset](http://dx.doi.org/10.1016/j.cell.2015.09.004). These data are independent from the [training set](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835)

### Requirements
  * [beanplot](https://www.jstatsoft.org/article/view/v028c01)
  * [samtools](http://www.htslib.org)
  * [bedtools](http://bedtools.readthedocs.io/en/latest)

### Config file
Adjust __Empirical_validation.cfg__ to change the following:
  * Paths to python/samtools/bedtools/Rscript
  * Thresholds for mapping-quality and read count
  * FDR for CNV calling

### Running

bash run_empirical_validation.sh [directory for train] [directory for test] [.bed files for CNV segments] [base name]



  * __[directory for train]__: path to directory containing aligned bam files of scRNA-seq data used as a control to call CNVs

  * __[directory for test]__: path to directory containing aligned bam files of scRNA-seq data known not to have CNVs, used as a gold standard.

  * __[BED file for CNV segments]__ : same as described in run_CONICS.sh

### Output
Box plot of FDRs will be generated (__boxplot.pdf__) in the output directory.
![empirical](images/Empirical_boxplot.jpg?raw=true "empirical_Test" )