KlugerLab / GeneTrajectory-python

Python implementation of Gene Trajectory
MIT License
18 stars 2 forks source link

Gene_dist_mat #9

Closed DAOl44732 closed 4 months ago

DAOl44732 commented 4 months ago

Dear Authors: Gene Trajectory is a great tool , but I've run into a couple of problems with it. The code I used as follows: run_dm(adata) cell_graph_dist = get_graph_distance(adata, k=10) gene_expression_updated, graph_dist_updated = coarse_grain_adata(adata, graph_dist=cell_graph_dist, features=genes, n=500) gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, ot_cost=graph_dist_updated, show_progress_bar=True)

geneembedding, = get_gene_embedding(gene_dist_mat, k = 5) /home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py:62: RuntimeWarning: invalid value encountered in divide affinity_matrix = np.exp(-dists 2 / (sigma 2)[:, None]) gene_trajectory = extract_gene_trajectory(gene_embedding, gene_dist_mat, t_list = [4, 8, 7], gene_names=genes, k=5) /home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py:62: RuntimeWarning: invalid value encountered in divide affinity_matrix = np.exp(-dists 2 / (sigma 2)[:, None]) Traceback (most recent call last): File "", line 1, in File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/extract_gene_trajectory.py", line 148, in extract_gene_trajectory df[f'Pseudoorder-{i + 1}'] = get_gene_pseudoorder(dist_mat=dist_mat, File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/extract_gene_trajectory.py", line 73, in get_gene_pseudoorder dmemb, = diffusion_map(emd) File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py", line 22, in diffusion_map affinity_matrix_symm = get_symmetrized_affinity_matrix(dist_mat=dist_mat, k=k, sigma=sigma) File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py", line 58, in get_symmetrized_affinity_matrix sigma = np.apply_along_axis(func1d=sorted, axis=1, arr=dists)[:, k - 1] # noqa File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/numpy/lib/shape_base.py", line 376, in apply_along_axis raise ValueError( ValueError: Cannot apply_along_axis when any iteration dimensions are 0 and I checked the 'gene_dist_mat' and 'X_dm', print(adata.obsm['X_dm']) [[ 6.1366682e-06 -8.7046203e-05 6.8969367e-04 ... 3.3708107e-05 -1.7335853e-05 2.1205349e-05] [ 3.4746859e-06 -1.7804912e-05 -6.5107843e-06 ... -1.0492504e-04 5.4499826e-05 -6.2288455e-05] [ 5.5011401e-06 -7.2641895e-05 5.3631212e-04 ... 3.1167172e-05 -1.8761852e-05 2.2736163e-05] ... [ 6.1141473e-06 -7.6604767e-05 5.5249571e-04 ... 2.9277440e-05 1.0723315e-06 -8.2528604e-06] [ 5.8460914e-06 -7.1613511e-05 5.2219094e-04 ... -6.3602209e-05 3.6228128e-05 -4.1688359e-05] [ 5.7346579e-06 -7.1003786e-05 5.1055290e-04 ... 1.3364493e-06 9.8475393e-06 -1.5540380e-05]] gene_dist_mat array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]]) What's wrong with this? Looking forward to your reply!

fra-pcmgf commented 4 months ago

Hi @DAOl44732 ,

I can see that your gene_dist_mat is all zeroes.

Can you try to rerun gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, ot_cost=graph_dist_updated, show_progress_bar=True) and show me what the input and outputs are?

DAOl44732 commented 4 months ago

ok,this is the output

gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, ot_cost=graph_dist_updated, show_progress_bar=True) /home/ps/miniconda3/envs/scanpy_env/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 428275/428275 [00:05<00:00, 81113.79it/s] what is the input?

fra-pcmgf commented 4 months ago

OK, it looks like there may be an issue with multiprocessing. You can try to fix it by

1) do multiprocessing.set_start_method spawn, which is typically the default. To do this, you can try the following command. If it doesn't work you may need to restart the python interpreter and do it before running gene-trajectory

import multiprocessing as mp
mp.set_start_method('spawn', force=True)

2) It may help to install a more recent Python version (e.g. 3.11). Newer versions of multiprocessing typically have a more robust default

fra-pcmgf commented 4 months ago

Can you also check that the gene_expression_updated, graph_dist_updated matrices that you use as input of cal_ot_mat don't have obvious issues (e.g. wrong dimensions, Nans, all entries being 0, ...)?

DAOl44732 commented 4 months ago

It seems that the cal_ot_mat have issues

gene_expression_updated array([[ 9.86707022e+01, 9.59792075e+01, -1.30310043e+01, ..., 1.23937154e+02, 1.01220958e+02, 7.18060511e+01], [ 3.89007300e-01, 5.48265398e-01, -8.57599825e-02, ..., 6.01899922e-01, 5.99583149e-01, 2.50267923e-01], [-4.85233915e+00, 5.72954202e+00, -2.28470582e+00, ..., 1.36198275e+00, 8.39792871e+00, -5.84423506e+00], ..., [ 3.41036916e-01, 5.57549655e-01, -9.64087620e-02, ..., 5.82365632e-01, 6.23681784e-01, 2.02495679e-01], [ 5.82811045e+01, -2.43654607e+01, 1.15525911e+01, ..., 1.74381011e+01, -3.95450627e+01, 5.91068948e+01], [-1.46138287e+01, 6.58543986e+00, -6.90161020e+00, ..., -1.69936148e+00, 1.54968016e+01, -1.63296460e+01]]) cal_ot_mat <function cal_ot_mat at 0x7f1b9257fa60> graph_dist_updated array([[ 4.45715851, 21.46815287, 50.06369427, ..., 17.46815287, 45.21132227, 23.83098472], [21.46815287, 0. , 54. , ..., 4. , 29.7431694 , 9.36283186], [50.06369427, 54. , 0.66666667, ..., 50. , 77.7431694 , 56.36283186], ..., [17.46815287, 4. , 50. , ..., 0. , 27.7431694 , 6.36283186], [45.21132227, 29.7431694 , 77.7431694 , ..., 27.7431694 , 6.6715638 , 27.82049422], [23.83098472, 9.36283186, 56.36283186, ..., 6.36283186, 27.82049422, 3.74093508]])

DAOl44732 commented 4 months ago

Thank for your reply! I used the codes, but it doesn't work . import multiprocessing as mp mp.set_start_method('spawn', force=True)

geneembedding, = get_gene_embedding(gene_dist_mat, k = 5) /home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py:62: RuntimeWarning: invalid value encountered in divide affinity_matrix = np.exp(-dists 2 / (sigma 2)[:, None]) gene_trajectory = extract_gene_trajectory(gene_embedding, gene_dist_mat, t_list = [4, 8, 7], gene_names=genes, k=5) /home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py:62: RuntimeWarning: invalid value encountered in divide affinity_matrix = np.exp(-dists 2 / (sigma 2)[:, None]) Traceback (most recent call last): File "", line 1, in File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/extract_gene_trajectory.py", line 148, in extract_gene_trajectory df[f'Pseudoorder-{i + 1}'] = get_gene_pseudoorder(dist_mat=dist_mat, File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/extract_gene_trajectory.py", line 73, in get_gene_pseudoorder dmemb, = diffusion_map(emd) File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py", line 22, in diffusion_map affinity_matrix_symm = get_symmetrized_affinity_matrix(dist_mat=dist_mat, k=k, sigma=sigma) File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/gene_trajectory/diffusion_map.py", line 58, in get_symmetrized_affinity_matrix sigma = np.apply_along_axis(func1d=sorted, axis=1, arr=dists)[:, k - 1] # noqa File "/home/ps/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/numpy/lib/shape_base.py", line 376, in apply_along_axis raise ValueError( ValueError: Cannot apply_along_axis when any iteration dimensions are 0 Is there any other solution?

fs-ravenbiosciences commented 4 months ago

I can't see anything in your log.

Can you try running this code to check that multiprocessing and cal_ot_mat work on your system?

import numpy as np 
from gene_trajectory.gene_distance_shared import cal_ot_mat

gdm = np.array([
        [0, 1, 2],
        [1, 0, 2],
        [2, 2, 1]])

gem = np.array([
        [.3, .2, .6],
        [.1, .3, .2],
        [.6, .5, .2]])

expected_emd = np.array([
    [0.0, 0.8, 1.0],
    [0.8, 0.0, 0.9],
    [1.0, 0.9, 0.0]])

mt = cal_ot_mat(gdm, gem)

np.testing.assert_almost_equal(expected_emd, mt, 6)
DAOl44732 commented 4 months ago

ok,the results as follows

import numpy as np from gene_trajectory.gene_distance_shared import cal_ot_mat

gdm = np.array([ ... [0, 1, 2], ... [1, 0, 2], ... [2, 2, 1]])

gem = np.array([ ... [.3, .2, .6], ... [.1, .3, .2], ... [.6, .5, .2]])

expected_emd = np.array([ ... [0.0, 0.8, 1.0], ... [0.8, 0.0, 0.9], ... [1.0, 0.9, 0.0]])

mt = cal_ot_mat(gdm, gem) 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 21546.08it/s]

np.testing.assert_almost_equal(expected_emd, mt, 6)

fs-ravenbiosciences commented 4 months ago

ok, that seems to work.

I am wondering why in the other logs I don't see any of the output that looks like

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 21546.08it/s]

Can you check that you actually running cal_ot_mat in your input/script? Also, what is the input (gene_expression_updated, graph_dist_updated) and output (gene_dist_mat) when you do?

DAOl44732 commented 4 months ago

you mean this? gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, ot_cost=graph_dist_updated, show_progress_bar=True) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 428275/428275 [00:05<00:00, 72053.15it/s]

fra-pcmgf commented 4 months ago

yes. Can you print gene_expression_updated , graph_dist_updated and gene_dist_mat?

DAOl44732 commented 4 months ago

Of course,

gene_expression_updated array([[ 9.93878015e+01, 9.65442272e+01, -1.30874277e+01, ..., 1.24742602e+02, 1.01785449e+02, 7.23576699e+01], [ 3.89007300e-01, 5.48265398e-01, -8.57599825e-02, ..., 6.01899922e-01, 5.99583149e-01, 2.50267923e-01], [-4.85233915e+00, 5.72954202e+00, -2.28470582e+00, ..., 1.36198275e+00, 8.39792871e+00, -5.84423506e+00], ..., [ 3.41036916e-01, 5.57549655e-01, -9.64087620e-02, ..., 5.82365632e-01, 6.23681784e-01, 2.02495679e-01], [ 5.82811045e+01, -2.43654607e+01, 1.15525911e+01, ..., 1.74381011e+01, -3.95450627e+01, 5.91068948e+01], [-1.46138287e+01, 6.58543986e+00, -6.90161020e+00, ..., -1.69936148e+00, 1.54968016e+01, -1.63296460e+01]]) graph_dist_updated array([[ 4.47460343, 21.47468354, 50.07278481, ..., 17.47468354, 45.21785294, 23.8375154 ], [21.47468354, 0. , 54. , ..., 4. , 29.7431694 , 9.36283186], [50.07278481, 54. , 0.66666667, ..., 50. , 77.7431694 , 56.36283186], ..., [17.47468354, 4. , 50. , ..., 0. , 27.7431694 , 6.36283186], [45.21785294, 29.7431694 , 77.7431694 , ..., 27.7431694 , 6.6715638 , 27.82049422], [23.8375154 , 9.36283186, 56.36283186, ..., 6.36283186, 27.82049422, 3.74093508]]) gene_dist_mat array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])

fra-pcmgf commented 4 months ago

I can see some elements in gene_expression_updated are negative (e.g. -1.30874277e+01).

The issue is that cal_ot_mat treats the gene expression as a distribution when calculating the Earth Mover's distance between genes and, when it sees a negative value, it gives up and return 0. This is probably why gene_dist_mat has only zero elements and cal_ot_mat runs so quickly. I'll add a check for this in the next version (#10).

What values are you using as input for the gene_expression? Ideally you should use counts as the method used in select_top_genes is based on count data. See e.g. the tutorial on how to set up a layer with count data when counts are stored as raw data in a scanpy object .

DAOl44732 commented 4 months ago

Thank you very much for your help. I have solved the problem.

fra-pcmgf commented 4 months ago

I'm happy to see the problem is solved!