Extremely Slow in `cal_ot_mat`

Fantasque68 commented 1 month ago

Dear developers,

I have a tiny anndata object which shape is 1570 × 19241. But when I ran cal_ot_mat as tutorial "Gene Trajectory Python tutorial: Human myeloid", the progress is extremelty slow and no progress bar is shown.

It cost me more than 90 minute without any result, and it still running. System monitor suggests all cores are fully occupied.

It's very weird for such a tiny dataset. Any help are greatly appreciated.

Here is my code.

from gene_trajectory.add_gene_bin_score import add_gene_bin_score
from gene_trajectory.coarse_grain import select_top_genes, coarse_grain_adata
from gene_trajectory.extract_gene_trajectory import get_gene_embedding, extract_gene_trajectory
from gene_trajectory.get_graph_distance import get_graph_distance
from gene_trajectory.gene_distance_shared import cal_ot_mat
from gene_trajectory.run_dm import run_dm
from gene_trajectory.plot.gene_trajectory_plots import plot_gene_trajectory_3d, plot_gene_trajectory_umap
from gene_trajectory.util.download_file import download_file_if_missing

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

# Prepare the input for gene-gene Wasserstein distance computation
genes = select_top_genes(NK, 
                         layer='counts', 
                         n_variable_genes=3000)
run_dm(NK)
cell_graph_dist = get_graph_distance(NK, k=10)
gene_expression_updated, graph_dist_updated = coarse_grain_adata(NK, 
                                                                 graph_dist=cell_graph_dist, 
                                                                 features=genes, 
                                                                 n=500)

gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, 
                           ot_cost=graph_dist_updated, 
                           show_progress_bar=True,
                           processes=26) # I tried 26 and default param both.

My anndata object,

AnnData object with n_obs × n_vars = 1570 × 19241
    obs: 'sample', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes', 'doublet_score', 'predicted_doublet', 'n_counts', 'leiden', 'annot', 'NK_anno', 'NK_anno_L3'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'NK_anno_L3_colors', 'NK_anno_colors', 'annot_colors', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'sample_colors', 'scrublet', 'umap'
    obsm: 'X_pca', 'X_pca_harmony', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

My packages,

gene-trajectory               1.0.4

Python version 3.9.19

fra-pcmgf commented 1 month ago

Unfortunately the computation of the earth-mover distance is time consuming, so there are two possible approaches

1) Reduce to top genes and coarse grain cells

This is the way used in the tutorial and the relevant code looks like

genes = select_top_genes(adata, layer='counts')
gene_expression_updated, graph_dist_updated = coarse_grain_adata(adata, graph_dist=cell_graph_dist, features=genes, dims=10)

2) Limit the cell-cell pairs for which the distance is calculated

The function cal_ot_mat accepts an optional parameter gene_pairs which allows to limit the number of entries of the cell-cell distance matrix that get computed.

For an in-depth discussion, please have a look at this article for the R version of the tool https://klugerlab.github.io/GeneTrajectory/articles/fast_computation.html.

Fantasque68 commented 1 month ago

Thank @fra-pcmgf.

However I have already reduce to top genes and coarse grain cells, as the tutorial did. The only difference is that I set n_variable_genes param to 3000, which is 500 originally.

genes = select_top_genes(..., n_variable_genes=3000)
# .......
gene_expression_updated, graph_dist_updated = coarse_grain_adata(..., features=genes, n=500)

In "Improve the computation efficiency of gene-gene distance matrix" documentation, it said that,

When the cell graph is large, the time cost for finding the optimal transport solution increases exponentially.

But currently this anndata object only has 1570 cells, and the number of gene is limited to 3000. Is this still a "large cell graph" as the doc indicated?

I'm just wondering that, is such a long calculation period is normal for dataset of this size, or some kinds of bug?

I'm not sure and have no idea about these questions.

fra-pcmgf commented 1 month ago

Can you try to run the mouse dermal tutorial?

It uses the defaults for select_top_genes (n_variable_genes=2000) and coarse_grain (n = 1000). In my experience the computation takes around 45 minutes using a Linux machine with 24 cpus, but could be faster/slower depending on the number of cores.

Fantasque68 commented 1 month ago

Sorry for the late response.

I have just ran the mouse tutorial, and all the parameters were kept same with the tutorial.

My cpus are intel Xeon E5 2680v4 × 2, all 28 cores are used in calculation process.

I manually stopped the cal_ot_mat process at 21%|██ | 145875/692076 [12:53<48:17, 188.49it/s]. It cost me 13m 37.9s to reach 21%, thus theoretically, the entire process should be around 60m. That is, for a dataset of shape 10328 × 32285, already performed coarse_grain_adata, selected 2000 top genes by select_top_genes, cal_ot_mat process will cost 1h to calculate.

So it's even more weird to spend more time on a smaller dataset, which shape is 1570 × 19241, with coarse_grain_adata and select_top_genes performed. Also, during the calculation of my own dataset, the progress bar never show. I doubt whether there are some bugs? 🤔

Any idea?

fra-pcmgf commented 1 month ago

The speed on the mouse tutorial (you can see the ETA of 48:17 in the progress bar) looks similar to mine, so I don't think there is any issue with package versions or your installation.

I'm not sure why the progress bar doesn't show. It could be because the parallel computation never starts (e.g. if the matrix is large and the machine runs out of memory or there are issues with Python's multiprocessing), but I can't think of why it should happen if you can run the tutorial which has a similar size.

Can you check that you are running cal_ot_mat with gene_expression_updated and graph_dist_updated as input and let me know the size of your input matrices?

For example running

print('gene_expression_updated:', gene_expression_updated.shape)
print('graph_dist_updated:', graph_dist_updated.shape)

gene_dist_mat = cal_ot_mat(gene_expr=gene_expression_updated, ot_cost=graph_dist_updated, show_progress_bar=True)

returns for the mouse tutorial

gene_expression_updated: (1000, 1177)
graph_dist_updated: (1000, 1000)

KlugerLab / GeneTrajectory-python

Extremely Slow in `cal_ot_mat` #15

1) Reduce to top genes and coarse grain cells

2) Limit the cell-cell pairs for which the distance is calculated