dpeerlab / Palantir

Single cell trajectory detection
https://palantir.readthedocs.io
GNU General Public License v2.0
221 stars 51 forks source link

Gene expression trends #141

Closed DAOl44732 closed 6 months ago

DAOl44732 commented 6 months ago

Dear Palantir developers, Palantir is a powerful tool for dealing with pseudotime, but I'm having some problems with it. These are codes I used:

palantir.plot.plot_palantir_results(ad) masks = palantir.presults.select_branch_cells(ad, eps=0) palantir.plot.plot_trajectory(ad, "c3_DNT_APOE") [2024-05-14 14:26:31,993] [INFO ] Using non-sparse Gaussian Process since n_landmarks (50) >= n_samples (0) and rank = 1.0. [2024-05-14 14:26:31,993] [INFO ] Using covariance function Matern52(ls=1.0009662628173828). [2024-05-14 14:26:31,994] [INFO ] Recomputing covariance decomposition for predictive function. <Axes: title={'center': 'Branch: c3_DNT_APOE'}> and the results as follows: 微信图片_20240514162702 微信图片_20240514162708 It looks as if there are no cells on this branch, is this a bug in the codes or does it really not exist biologically. Hoping for your reply!

katosh commented 6 months ago

Hi @DAOl44732, thank you for reporting! Could you provide us with the output of palantir.plot.plot_branch_selection(ad)? This may help us understand why there are no cells selected for the branch, and to consider changing some parameters of the branch selection step.

DAOl44732 commented 6 months ago

I'm sorry. That was my mistake. The palantir.plot.plot_branch_selection(ad) as follows:

palantir.plot.plot_branch_selection(ad) <Figure size 1500x1500 with 6 Axes> 捕获 and the masks = palantir.presults.select_branch_cells(ad, eps=0) result masks = palantir.presults.select_branch_cells(ad, eps=0) print(masks) [[False False False] [False False False] [False False False] ... [False False False] [False False False] [False False False]] print("Selected cells:", masks.sum()) Selected cells: 0 Hoping for your reply!

DAOl44732 commented 6 months ago

微信图片_20240515013250

katosh commented 6 months ago

Thank you! It seems no cells are being selected.

This could be due to two different reasons.

1. NaNs in the fate probabilities

Depending on which version of Palantir you are using there might be NaNs in the fate probability values that prohibit a good branch selection. You can see the number of NaN in the fate probabilities by running ad.obsm["palantir_fate_probabilities"].isna().sum(). How does your output look like.

2. too stringent parameters

It might be due to the stringent choice of parameters like eps=0. While this is the value used for the tutorial dataset, it is often good to use higher values. Either leaving it at its default or setting it something higher between 0 and 1. The same is true for the parameter q which can also be increased to be more tolerant, and to include more cells into the selection of the branches.

DAOl44732 commented 6 months ago

Thank you! The ad.obsm["palantir_fate_probabilities"].isna().sum() as follows:

ad.obsm["palantir_fate_probabilities"].isna().sum() HC_HRA001261_HRS280155_PT_TGTTTGTCAGCCTATA 3 LC_GSE200972_GSM6047625_PT_CTGCATCTCTTCCTAA 3 OC_GSE184880_GSM5599227_PT_GCACGTGCAATAGAGT 3 In addition, we adjusted the eps, but the result still showed no cells. masks = palantir.presults.select_branch_cells(ad, eps=0.99) print("Selected cells:", masks.sum()) Selected cells: 0 masks = palantir.presults.select_branch_cells(ad, eps=1)

print("Selected cells:", masks.sum()) Selected cells: 0 masks = palantir.presults.select_branch_cells(ad, eps=0) print("Selected cells:", masks.sum()) Selected cells: 0

katosh commented 6 months ago

Thanks! It seems like there are a total of 9 NaN values in the fate probabilities. This should fix the problem:

ad.obsm["palantir_fate_probabilities"] = ad.obsm["palantir_fate_probabilities"].fillna(0)

You can still play around with eps and q. Usually values between 0 and 0.2 work well.

Thank you for reporting. I will try to make this more robust in a future patch.

katosh commented 6 months ago

The latest version on Github should now also be able to do branch selection if there NaNs in the fate probabilities. You can try it out by installing it with

pip install 'git+https://github.com/dpeerlab/Palantir'

Please let me know if you need any further help with this issue!

DAOl44732 commented 6 months ago

Thank you very much! I solved the problem using the ad.obsm["palantir_fate_probabilities"] = ad.obsm["palantir_fate_probabilities"].fillna(0). The result as follows:

微信图片_20240517000546

I would like to know exactly what cell subpopulations this time curve goes through, is this achievable? Thanks again for your answers!

katosh commented 6 months ago

I think the boolean masks generated for the branch selection might help you. You can, e.g., use the masks to subset you anndata, count, and plot the cells of the specific branches. E.g.:

branch_name = "c3_DNT_APOE"
mask = ad.obsm["branch_masks"][branch_name]
sc.pl.embedding(ad[mask, :], "umap", color="celltype")

The code above assumes that you have a column ad.obs["celltype"] in your anndata.

yitengfei120011 commented 3 months ago

Hello, how is the gene expression trend calculated by Mellon? I have read Mellon's article, but I still have questions about the process of calculating gene expression trend. Could you please describe the calculation process in detail?

katosh commented 3 months ago

Hello @yitengfei120011,

Thank you for your question! I'd be happy to clarify how Mellon calculates the gene expression trend.

Overview of the Process:

Mellon models the gene expression trend using a Gaussian Process (GP). A GP is a probabilistic model where any collection of random variables (in this case, gene expression values at different pseudotime points) has a joint Gaussian distribution. The trend function we estimate is considered as a sample from this GP.

Covariance Structure:

In Mellon, the covariance between function values is defined by the Matern52 kernel, which is a common choice for modeling smooth, yet flexible functions. The key parameters for this kernel are:

The covariance function essentially encodes our belief that points close in pseudotime should have more similar expression values, while points further apart might have less similar values.

Conditioning on Observed Data:

Once the GP is defined, we condition the trend on the observed gene expression data across cells. This is done by leveraging the properties of the Multivariate Normal distribution, allowing us to update our prior belief (the GP) with the actual data to get a posterior distribution. The mean of this posterior distribution is the gene expression trend that Mellon estimates.

Inducing Points for Efficiency:

For large datasets, directly computing the trend can be computationally expensive due to the size of the covariance matrix. Mellon addresses this by using inducing "landmark" points. These points act as a sparse approximation to the full dataset, enabling efficient computation without significantly compromising accuracy. The number of inducing points is typically set to match the number of grid points where the trend is evaluated.

Further Reading:

For a deeper understanding of these concepts, I recommend exploring resources on Gaussian Processes. Some starting points include:

I hope this helps clarify the process. Let me know if you have any further questions!