The goal of GAC is to deliver a formal end-to-end analysis by integrating proven methods of quantitative genetics, statistics, and evolutionary biology for the genetic analysis of single-cell DNA copy number. GAC implements a simple, lightweight, and open-source R framework (Figure 1). Inspired, but unlike Seurat and Scanpy, adapts the logic of espressioSet/AnnData into relational matrices in native R which facilitates the integration of algorithms for the downstream analysis of single-cell DNA data which is so desperately needed.
GAC facilitates the downstream analyses of segmented data with common segments by concurrently managing the X, and Y across all cells or samples e.g. the output of Varbin/Ginkgo, FACETS, MUMdex, HMMcopy, or SCOPE. The unsegmented bin read counts is not a correct input. GAC uses ComplexHeatmap, an ultra-powerful tool for heatmaps to help visualize the data.
To implement GAC we require five easy-to-generate inputs:
install.packages("devtools")
devtools::install_github("KrasnitzLab/SCclust")
install.packages("BiocManager")
BiocManager::install(c("ComplexHeatmap", "ConsensusClusterPlus"))
You can install the development version from GitHub with:
devtools::install_github("SingerLab/gac")
This is a basic example for drawing a copy number heatmap. For a
comprehensive overview of the package please follow the
getting_started.Rmd
in the vignettes/
library(gac)
## basic example code
data(cnr)
( excl.cells <- rownames(cnr$qc)[cnr$qc$qc.status == "FAIL"] )
#> [1] "cell5" "cell11"
cnr <- excludeCells(cnr, excl = excl.cells)
aH <- HeatmapCNR(cnr, what = 'X', col = segCol, show_heatmap_legend = FALSE)
draw(aH, annotation_legend_list = list(legSeg))
bH <- HeatmapCNR(cnr, what = "genes",
which.genes = c("CDK4", "MDM2"),
col = segCol, show_heatmap_legend = FALSE)
#> Warning: The input is a data frame, convert it to the matrix.
draw(bH, annotation_legend_list = list(legSeg))
This package came out of the need to deliver some results on borrowed
time. During the 11th hour, I saw I was spending 85%
phenotypes), 10% rendering heatmaps, and 5% looking at
my time keeping multiple synchronized tables of bins, annotations,
genes, and results. I began to think how lucky the people who only
work with single-cell RNAseq are to have tools like Seurat and Scanpy,
how simple and flexible those two tools are, and how nothing for DNA
copy number is as powerful as the sister tools Seurat and Scanpy to
manage the copy number matrix. I eventually realized that the main
difference is the restriction imposed by the genome coordinates.
While staring at the AnnData diagram I realized that for copy number
data, the unit is a bin
and the .X should be a matrix of common
bins
for all cells. However, to make biological sense of the data,
gene level resolution is required. Thus, building a synchronized
matrix with genes is of utmost importance. Having an internal
gene-to-bin index (gene.index) allowed the flexibility to
interpolate the bin data to gene-level resolution and integration to
the complete set of phenotypes, and QC data.
The need to have a simple tool to manage input and output reduced
85% of the time spent synchronizing bins, genes, phenotypes, and QC
matrices capable of handling a large data set of >24,000 cells
was greatly needed. Knowing the data is growing by the
week. Functions to handle the n+1 problem are integrated via addCells
.
We hope you enjoy !
Integration with MLR for non-linear genetic models
Integration with CORE and GISTIC2 for finding focal and recurrent events
Integration of infScite for somatic alteration evolution
Integration with Pathview for KEGG pathway visualization
support for .seg files
Cleaner code with tidyverse
CRAN testing
GAC framework and code is distributed under a BSD-3 License