KlugerLab / GeneTrajectory

R implementation of GeneTrajectory
https://www.nature.com/articles/s41587-024-02186-3
39 stars 9 forks source link

Issue with compute gene-gene distances #4

Open chen-peng-1874 opened 2 months ago

chen-peng-1874 commented 2 months ago

I tried to set up a virtualenv using [reticulate], however, I can not find the module. Here is the output: > cal_ot_mat_from_numpy <- reticulate::import('gene_trajectory.compute_gene_distance_cmd')$cal_ot_mat_from_numpy Error: C:/Users/Public/miniconda3/python310.dll - The specified module could not be found.

Should I use Python instead R?

Fufu-Hu commented 2 months ago

Do you install the gene-trajectory module?

You can try below code in R. reticulate::py_install("gene-trajectory")

chen-peng-1874 commented 2 months ago

Do you install the gene-trajectory module?

You can try below code in R. reticulate::py_install("gene-trajectory")

Yes, I installed it. But the error still present. I am wondering if there's something wrong with the python.dll. Although I do have the python310.dll.

fra-pcmgf commented 2 months ago

Hi, I'm not sure what the issue is, but can you try to run reticulate::py_list_packages() and check the output? You should have a line like

14     gene-trajectory    1.0.0     gene-trajectory=1.0.0        pypi

If gene trajectory is not there, can you try to install it as reticulate::py_install("gene-trajectory", pip = TRUE)? The pip=TRUE option may be needed since we do not have a conda package for gene-trajectory.

DAOl44732 commented 2 months ago

data_S <- GeneTrajectory::RunDM(data_S) cell.graph.dist <- GetGraphDistance(data_S, K = 10) cg_output <- CoarseGrain(data_S, cell.graph.dist, genes, N = 1000) Hello,Because the data is too big to run in R can't the gene distance above be run in python?

fra-pcmgf commented 2 months ago

yes, it's possible to export the data to a folder and run using Python as described in https://github.com/KlugerLab/GeneTrajectory/issues/3#issuecomment-2070566770

It may be also interesting to reduce the data size as explained in https://klugerlab.github.io/GeneTrajectory/articles/fast_computation.html

DAOl44732 commented 2 months ago

data_S <- GeneTrajectory::RunDM(data_S) Thank you for your reply. But the problem occurs in this step, the error shows that the data is greater than 1000GiB, is there a good solution?

fra-pcmgf commented 2 months ago

I see, it's possible to do the whole analysis in Python (see e.g. https://github.com/KlugerLab/GeneTrajectory-python and https://genetrajectory-python.readthedocs.io/latest/notebooks/tutorial_mouse_dermal.html for a tutorial).

However, I am afraid you will encounter similar issues. Computing the diffusion map in RunDM creates a cell-cell distance matrix, which is quadratic in the number of cells and require a lot of memory and time to run. How many cells do you have?

DAOl44732 commented 2 months ago

I see,I will try python first. However,we have about 340,000 cells.Do you have any more suggestions?

fra-pcmgf commented 2 months ago

I would try randomly subsampling cells to a smaller number (~10k should be manageable, but you can probably do more) or partition the data if you have some meaningful metadata. You can then run runDM and then follow the pipeline (which will use CoarseGrain to 1000-2000 or a procedure like https://klugerlab.github.io/GeneTrajectory/articles/fast_computation.html to further coarse-grain).

Python and R should have similar performances, so use the one you that makes the most sense.

It should be possible to subsample in a better way than random for large datasets, but we haven't investigated that yet. The method we use to coarse-grain cells CoarseGrain is based on having a cell-cell distance matrix. One could probably try a similar knn-based approach on a simpler gene embedding that could handle data of your size, but we haven't tested it and it's hard to predict if it would behave correctly.

DAOl44732 commented 2 months ago

Thank you I'll try your advice.

DAOl44732 commented 2 months ago

Can I use this code( dm_res = palantir.utils.run_diffusion_maps(ad, n_components=5) )instead of (run_dm(adata) )to calculate the intercellular distance?

fs-ravenbiosciences commented 2 months ago

I don't have experience with that package but the implementation looks similar. I think you can try it as alternative, just make sure to refer to the layer where the result is put (our package uses "X_dm", change it accordingly).