KlugerLab / GeneTrajectory

R implementation of GeneTrajectory
https://www.nature.com/articles/s41587-024-02186-3
50 stars 9 forks source link

There is no output even a warning, when I compute gene-gene distances with the function cal_ot_mat_from_numpy. #5

Open Fufu-Hu opened 7 months ago

Fufu-Hu commented 7 months ago

Hi!

I installed module gene_trajectory with pip in a conda env.I can comput the gene-gene distances with the seurat data in GeneTrajectory tutorial and the progress _bar are showed in screen. But when I comput my own seurat data(36077 features across 482 samples), there's nothing in screen. The number of gene used to compute gene-gene distances is 481 and meta-cells is 50. I run "gene.dist.mat <- cal_ot_mat_from_numpy(ot_cost = cg_output[["graph.dist"]], gene_expr = cg_output[["gene.expression"]], num_iter_max = 50000, show_progress_bar = TRUE)" in R for at least 8 hours with no output even a progress_bar. Is there something I missed?

Hope receive a reply~

fra-pcmgf commented 7 months ago

Hi @Fufu-Hu,

I am not sure about what it could be. 1) Can you check if anything is still running (e.g. using top or the Task Manager)? 2) Can you let me know the size of the objects (e.g. dim(cg_output[["graph.dist"]]), dim(cg_output[["gene.expression"]]))? I don't think it should be that slow is the size is 481x50, but it may be if you are using the full matrix. 3) Do you get any error or notifications when you start the cal_ot_mat_from_numpy function?

panyuwen commented 3 months ago

encounter similar problems.

it has been >4000 CPU hours, but without progress bar, for neither python or R. machine info: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz

the program seems working on another machine for the same data, the progress bar appeared in ~30 CPU hours. machine info: AMD Opteron(tm) Processor 6344

fra-pcmgf commented 3 months ago

Hi @panyuwen,

It's hard to know what is going wrong in one machine when it works on another.

panyuwen commented 3 months ago
panyuwen commented 3 months ago

using subset of my original data (17k cells x 10k genes), with default parameters, it takes about 2500 CPU hours from the beginning to the end of the gene.dist.mat step. the progress bar appeared during the final 6 mins (so only 6min recorded on the bar).

machine info: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz ; centos7

fra-pcmgf commented 3 months ago

@panyuwen

Do you also select the top genes and coarse grain cells? The reference steps in the tutorial are

genes = select_top_genes(adata, layer='counts')
gene_expression_updated, graph_dist_updated = coarse_grain_adata(adata, graph_dist=cell_graph_dist, features=genes, dims=10)

If so, what are the dimensions of gene_expression_updated and graph_dist_updated?

panyuwen commented 3 months ago

yes, I manually selected genes.

gene_expression_updated: (1000, 11352) graph_dist_updated: (1000, 1000)

fra-pcmgf commented 3 months ago

11352 genes is a large number and calculating the earth mover distance is going to be very slow. Try using ~2000 genes using select_top_genes or a similar approach