Closed dylanmmarshall closed 6 years ago
Hi @dyl4nm4rsh4ll ,
Thanks for the detailed description! To answer the technical meat of your question: "Can I draw conclusions based on distances between semi-distinct sub-clusters?" the answer is yes. PHATE is designed such that but local and global distances are meaningful. You could try running PHATE with gamma=0
(equivalent to the old potential_method='sqrt'
) which emphasises local distances more strongly to see if you can tease something more detailed out of the bifurcation.
Re: visualising this kind of data, I would recommend taking up to eight or so markers of interest, doing MAGIC (install with pip install magic-impute
) on them and plotting them on subplots, e.g.
import phate
import magic
import matplotlib.pyplot as plt
data = ...
phate_op = phate.PHATE(n_components=3, gamma=0)
data_phate = phate_op.fit_transform(data)
magic_op = magic.MAGIC()
genes = ['CTLA4', 'CD4', 'CD8A']
data_magic = magic_op.fit_transform(data, genes=genes)
fig, axes = plt.subplots(1,3, subplot_kw={'projection':'3d'})
for i, gene in enumerate(genes):
phate.plot.scatter3d(data_phate, c=data_magic[gene], title=gene, ax=axes[i], legend=False)
plt.tight_layout()
This will perhaps be easier to look at than the RGB values, and MAGIC will smooth out a lot of the missing values you're seeing. I'm not super familiar with the system, but if you gate the cells based on the bifurcation and look for differentially expressed genes (perhaps using EMD after MAGIC with genes='all_genes'
) between the branches, you might be able to pick out what genes are driving the differences between the green cells at the ends of the two branches.
Thanks for the prompt and equally detailed answer!
I'll try out modulating gamma and see what I can get out. I'm not sure if the person I'm working with is a fan of data imputation but I'll give MAGIC a shot - if even just for the coloring aspect itself. I dig the CoLab tutorial for it by the way.
I like the idea of using EMD, or perhaps some other distance function, to determine differentially expressed genes within the sub-clusters. Perhaps it'll be more clear after I try out MAGIC but I can see myself having to make arbitrary distinctions between PHATE sub-clusters before applying that calculation - which could be problematic.
Is this an appropriate medium for these kinds of questions? GitHub issues seems more for code but I didn't come across a discussion board or similar.
No worries! Always a pleasure to help out. One re: making distinctions between PHATE subclusters, we've found success in embedding the data with PHATE in many (10 or 20) dimensions, clustering with K-Means and then re-embedding in 2D to visualise. The clusters are then coherent with the PHATE embedding without having to be picked manually. It gives similar results to spectral clustering, but with some denoising added in.
We do have a Slack discussion group at www.krishnaswamylab.org/get-help , you're welcome to continue asking questions either here on there, as you prefer.
Gotcha, seems like a reasonable approach - will give that a shot as well!
Great, just joined. Thanks for the pointer.
Am looking for ideas on how best to describe output of PHATE as applied to some CAR T-cell 10X genomics scRNAseq data I've got ahold of. Biological context is these cells are CAR T-cells after transfection of construct that allows them to target cancer cells. Hope is that a dimensionality reduction technique could reveal something about relationship between sub-populations in this data. e.g: can I draw conclusions based on distances between semi-distinct sub-clusters? how to interpret the apparent bifurcation in plot below?
Had to come up with ad-hoc coloring scheme because there aren't time points of this data. Maybe a different dim. reduction technique would be more appropriate but plots below, by eye, suggest something?
Coloring scheme is based on three genes of interest and employs RGB combinations to reflect expression of these three genes - low (L), medium (M), or high (H) (categories are arbitrary percentiles). Radius of dots reflects relative expression of all three of the genes - the more expression, the bigger the radius. CTLA4 exists on the "B" axis (z-axis here), CD4 exists on the "R" axis, and CD8A exists on the "G" axis.
^ key
My interpretation is there is a definite split in identity of cells that are either expressing high CTLA4 and high CD4 or are expressing high CD8A. This coincidentally splits the color-scheme in two. Any opinions? The horseshoe shape and bifurcation point... it's quite mysterious.
^ view 1
^ same plot, different angle
^ same plot, rotated again