KlugerLab / GeneTrajectory

R implementation of GeneTrajectory
https://www.nature.com/articles/s41587-024-02186-3
39 stars 9 forks source link

About the selection of the terminus in gene trajectory #11

Closed Tom-900 closed 3 weeks ago

Tom-900 commented 3 weeks ago

Hi!

Thank you very much for developing this method. When reading the paper, I was quite confused about the following paragraph from Methods part, step 3 (Construct gene trajectories):

text

I wonder why we can make such an assumption? Is there any further explanation or references for this point? Thnk you very much and hope to hear from you.

Best, Tom

RihaoQu commented 3 weeks ago

Thank you for your question. Below I copied the response from Xiuyuan, the co-first author of the GeneTrajectory paper.


I think the question is about the rationale of using "spectral norm" ||S(x_i)|| to select the endpoint of the trajectory. The theoretical interpretation of data samples with large "spectral norm" can be found in the following papers:

[1] X. Cheng and G. Mishne. "Spectral embedding norm: looking deep into the spectrum of the graph Laplacian". SIAM Journal on Imaging Sciences (2020). [Abstract] [arXiv: 1810.10695] [Code]

[2] X. Cheng, G. Mishne, and S. Steinerberger. "The geometry of nodal sets and outlier detection". Journal of Number Theory (2017). [Abstract] [arXiv:1706.01362]

The basic idea is that the sample points with the largest spectral norm are those that are most "representative" within a cluster, in the sense of most uniquely belonging to a cluster and dissimilar with data points outside that cluster. This is reflected in the analysis in the above two papers under the setting of outlier points/clusters, and the experimental examples go beyond clustering - see the manifold+outlier case in [1].

In the case of gene trajectory, we are in the situation of not exactly clustering, but when several trajectories stem out from a "middle cohort". In this case, it still holds that the endpoint of a trajectory has the property that it is "connected/similar to" to points along that trajectory but very dissimilar to genes in any other trajectories or middle cohort. Thus it can be interpreted as a "representative point" or "outlier," and we expect that large spectral norm can work to find them.

The procedure was also used in an earlier paper by Gal and Raphy [3] https://www.biorxiv.org/content/10.1101/313981v1.abstract which is used to extract interesting clusters.

Tom-900 commented 3 weeks ago

Thank you so much for your kind reply! This is a very clear and thorough explanation which helps me a lot. Thanks again for your help.