G2G aims to guide downstream comparative analysis of single-cell reference and query systems along any axis of progression (e.g. pseudotime). This is done by employing a new dynamic programming (DP) based alignment algorithm which unifies dynamic time warping (DTW) and gap modelling to capture both matches and mismatches between time points. Our DP algorithm incorporates a Bayesian information-theoretic scoring scheme with a five-state probabilistic machine to generate an optimal sequential alignment between a reference trajectory (R) and query trajectory (Q) of a given gene in terms of their scRNA-seq expression. In this way, G2G framework infers a fully-descriptive alignment for each gene of a specified gene set, gene clusters of different alignment patterns, an average (cell-level) alignment across all gene alignments, and further statistics to support downstream analysis (e.g. ranking of genes based on their alignment similarities).
G2G framework can perform comparisons of gene expression dynamics across pseudotime such as:
G2G alignment enables us to pinpoint dynamic similarities and differences in gene expression between a reference and query, as well as to group genes with similar alignment patterns.
We recommend creating a new Conda environment before installing G2G from PyPi, to avoid any version conflicts and dependency issues.
conda create --name g2g_env python=3.8
conda activate g2g_env
pip install genes2genes
Or optionally install the latest version directly from GitHub:
pip install git+https://github.com/Teichlab/Genes2Genes.git
(1) Reference anndata object (with adata_ref.X
storing log1p transformed gene expression),
(2) Query anndata object (with adata_query.X
storing log1p transformed gene expression), and
(3) Pseudotime estimates stored in adata_ref.obs['time']
and adata_query.obs['time']
.
Note: Please ensure that you have reasonable pseudotime estimates that fairly represent the trajectories, as the accuracy and reliability of trajectory alignment entirely depend on the accuracy and reliability of your pseudotime estimation. We recommend users to inspect whether the cell density distribution along estimated pseudotime (in terms of the meta attributes such as annotated cell types, sampling time points, etc. where applicable) well-represents each trajectory of focus. Users can choose the best pseudotime estimates to compare after testing several different pseudotime estimation tools on their datasets.
notebooks/Tutorial.ipynb
is an example analysis between a reference and query dataset from literature.
Also refer to https://teichlab.github.io/Genes2Genes on how to read a trajectory alignment output generated by G2G.
This depends on the number of cells in the reference and query datasets, the number of interpolation time points, and the number of genes to align.
Below is a simple run-time analysis of G2G for 89 genes of the reference (NR = 179 cells) and query (NQ = 290 cells) from literature used in our tutorial.
Note: the number of interpolation points is 14 for the middle plot. (Reference: notebooks/Supplementary_notebook1.ipynb
)