A statistical learning method for simultaneous copy number estimation and subclone clustering with single cell sequencing data.
Fei Qin, Guoshuai Cai, Feifei Xiao
Most CNA detection methods for scDNA-seq were designed to detect CNAs and identify subclones in separate ways, which may generate spurious information (e.g., false positive CNAs) during the procedure of CNA detection and thereafter diminish the accuracy of identifying subpopulations from large complex cell group. To overcome this limitation, we developed a fused lasso mode-based framework, FLCNA, for CNA detection and simultaneous subclone identification in scDNA-seq data. First, procedures including quality control (QC), normalization, logarithm transformation were used for pre-processing of the datasets. Subclone clustering was achieved based on a Gaussian Mixture Model (GMM), and breakpoints detection was conducted by adding a fused lasso penalty term to the typical GMM model. Finally, based on these shared breakpoints in each cluster, candidate CNA segments for each cell were clustered into different CNA states using a GMM-based clustering strategy.
install.packages("devtools")
library(devtools)
install_github("FeifeiXiao-lab/FLCNA")
# The example data have 3,000 markers and 93 cells.
library(FLCNA)
data(KTN126_data_3000)
data(KTN126_ref_3000)
RD <- KTN126_data_3000
dim(RD)
[1] 3000 93
ref <- KTN126_ref_3000
head(ref)
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | gc mapp
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 800001-900000 * | 56.65 0.920803
[2] chr1 900001-1000000 * | 62.13 0.973863
[3] chr1 1000001-1100000 * | 60.35 0.946335
[4] chr1 1100001-1200000 * | 61.37 0.976173
[5] chr1 1200001-1300000 * | 63.52 0.969312
[6] chr1 1300001-1400000 * | 56.19 0.946105
-------
seqinfo: 24 sequences from hg19 genome
# Quality Control
QCobject <- FLCNA_QC(Y_raw=RD, ref_raw=ref,
mapp_thresh = 0.9,
gc_thresh = c(20, 80))
# Normalization
log2Rdata <- FLCNA_normalization(Y=QCobject$Y, gc=QCobject$ref$gc, map=QCobject$ref$mapp)
# Simultaneous CNA detection and subclone clustering
output_FLCNA <- FLCNA(K=c(3,4,5), lambda=1.5, Y=data.matrix(log2Rdata))
# CNA clustering
CNA.output <- CNA.out(mean.matrix = output_FLCNA$mu.hat.best, LRR=log2Rdata, Clusters=output_FLCNA$s.hat.best, ref=ref, cutoff=0.8, L=100)