Open Fantasque68 opened 1 month ago
Unfortunately the computation of the earth-mover distance is time consuming, so there are two possible approaches
This is the way used in the tutorial and the relevant code looks like
genes = select_top_genes(adata, layer='counts')
gene_expression_updated, graph_dist_updated = coarse_grain_adata(adata, graph_dist=cell_graph_dist, features=genes, dims=10)
The function cal_ot_mat
accepts an optional parameter gene_pairs
which allows to limit the number of entries of the cell-cell distance matrix that get computed.
For an in-depth discussion, please have a look at this article for the R version of the tool https://klugerlab.github.io/GeneTrajectory/articles/fast_computation.html.
Thank @fra-pcmgf.
However I have already reduce to top genes and coarse grain cells, as the tutorial did. The only difference is that I set n_variable_genes
param to 3000, which is 500 originally.
genes = select_top_genes(..., n_variable_genes=3000)
# .......
gene_expression_updated, graph_dist_updated = coarse_grain_adata(..., features=genes, n=500)
In "Improve the computation efficiency of gene-gene distance matrix" documentation, it said that,
When the cell graph is large, the time cost for finding the optimal transport solution increases exponentially.
But currently this anndata
object only has 1570 cells, and the number of gene is limited to 3000. Is this still a "large cell graph" as the doc indicated?
I'm just wondering that, is such a long calculation period is normal for dataset of this size, or some kinds of bug?
I'm not sure and have no idea about these questions.
Can you try to run the mouse dermal tutorial?
It uses the defaults for select_top_genes
(n_variable_genes=2000) and coarse_grain
(n = 1000). In my experience the computation takes around 45 minutes using a Linux machine with 24 cpus, but could be faster/slower depending on the number of cores.
Sorry for the late response.
I have just ran the mouse tutorial, and all the parameters were kept same with the tutorial.
My cpus are intel Xeon E5 2680v4 × 2, all 28 cores are used in calculation process.
I manually stopped the cal_ot_mat process
at
21%|██ | 145875/692076 [12:53<48:17, 188.49it/s]
.
It cost me 13m 37.9s to reach 21%, thus theoretically, the entire process should be around 60m. That is, for a dataset of shape 10328 × 32285, already performed coarse_grain_adata
, selected 2000 top genes by select_top_genes
, cal_ot_mat process
will cost 1h to calculate.
So it's even more weird to spend more time on a smaller dataset, which shape is 1570 × 19241, with coarse_grain_adata
and select_top_genes
performed. Also, during the calculation of my own dataset, the progress bar never show. I doubt whether there are some bugs? 🤔
Any idea?
The speed on the mouse tutorial (you can see the ETA of 48:17 in the progress bar) looks similar to mine, so I don't think there is any issue with package versions or your installation.
I'm not sure why the progress bar doesn't show. It could be because the parallel computation never starts (e.g. if the matrix is large and the machine runs out of memory or there are issues with Python's multiprocessing), but I can't think of why it should happen if you can run the tutorial which has a similar size.
Can you check that you are running cal_ot_mat
with gene_expression_updated
and graph_dist_updated
as input and let me know the size of your input matrices?
For example running
print('gene_expression_updated:', gene_expression_updated.shape)
print('graph_dist_updated:', graph_dist_updated.shape)
gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, ot_cost=graph_dist_updated, show_progress_bar=True)
returns for the mouse tutorial
gene_expression_updated: (1000, 1177)
graph_dist_updated: (1000, 1000)
Dear developers,
I have a tiny anndata object which shape is 1570 × 19241. But when I ran
cal_ot_mat
as tutorial "Gene Trajectory Python tutorial: Human myeloid", the progress is extremelty slow and no progress bar is shown.It cost me more than 90 minute without any result, and it still running. System monitor suggests all cores are fully occupied.
It's very weird for such a tiny dataset. Any help are greatly appreciated.
Here is my code.
My
anndata
object,My packages,
Python version
3.9.19