merge_clusters error: max() arg is an empty sequence

omarOMF commented 2 months ago

Hi, Thank you for developing this helpful package! I am currently trying to optimize the number of clusters for a cell population using your package, but I've encountered an error that I'm having trouble interpreting. Here's the sequence of functions I ran:

cc.tl.get_markers(adata, 'leiden_05')
cc.tl.code_enrich(adata, 'leiden_05')
cc.pl.enrich_heatmap(adata, 'leiden_05')

However, I encounter the following error when I attempt to merge clusters. Could you please help me understand what might be going wrong? Here's the error message I received:

cc.tl.merge_clusters(adata, 'leiden_05')

Initial merge. WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var' /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2( /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2( /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2( /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2( /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2( /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2 self.stats[group_name, 'logfoldchanges'] = np.log2( WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var' /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /home/FCAM/xxx/miniconda3/envs/scv/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: invalid value encountered in true_divide ret = um.true_divide( /home/FCAM/xxxx/miniconda3/envs/scv/lib/python3.9/site-packages/scanpy/preprocessing/_utils.py:11: RuntimeWarning: Mean of empty slice. mean_sq = np.multiply(X, X).mean(axis=axis, dtype=np.float64)

ValueError Traceback (most recent call last) Cell In[15], line 1 ----> 1 cc.tl.merge_clusters(adata, 'leiden_05')

File ~/miniconda3/envs/scv/lib/python3.9/site-packages/cytocipher/score_and_merge/cluster_merge.py:580, in merge_clusters(data, groupby, var_groups, n_top_genes, t_cutoff, marker_padj_cutoff, gene_order, min_de, enrich_method, p_cut, max_iter, mnn_frac_cutoff, k, random_state, n_cpus, score_group_method, p_adjust, p_adjust_method, squash_exception, verbose) 574 get_markers(data, f'{groupby}_merged', n_top=n_top_genes, verbose=False, 575 var_groups=var_groups, t_cutoff=t_cutoff, 576 padj_cutoff=marker_padj_cutoff, 577 gene_order=gene_order, min_de=min_de) 579 # Running the enrichment scoring # --> 580 run_enrich(data, f'{groupby}_merged', enrich_method, n_cpus, 581 squash_exception=squash_exception) 583 if verbose: 584 print(f"Added data.obs[f'{groupby}_merged']")

File ~/miniconda3/envs/scv/lib/python3.9/site-packages/cytocipher/score_and_merge/cluster_merge.py:419, in run_enrich(data, groupby, enrich_method, n_cpus, squash_exception) 415 raise Exception( 416 f"Got enrich_method={enrich_method}; expected one of : {enrich_options}") 418 if enrich_method == 'code': --> 419 code_enrich(data, groupby, n_cpus=n_cpus, verbose=False, 420 squash_exception=squash_exception) 421 elif enrich_method == 'coexpr': 422 coexpr_enrich(data, groupby, n_cpus=n_cpus, verbose=False)

File ~/miniconda3/envs/scv/lib/python3.9/site-packages/cytocipher/score_and_merge/cluster_score.py:593, in code_enrich(data, groupby, cluster_marker_key, n_cpus, min_counts, squash_exception, verbose) 590 [all_genes.extend(cluster_genes_dict[cluster]) 591 for cluster in cluster_genes_dict] 592 # Getting correct typing --> 593 str_dtype = f"<U{max([len(gene_name) for gene_name in all_genes])}" 594 all_genes = np.unique( all_genes ).astype(str_dtype) 596 str_dtype_clust = f"<U{max([len(clust) for clust in cluster_genes_dict])}"

ValueError: max() arg is an empty sequence

BradBalderson commented 2 months ago

Hey Omar,

If you check adata.uns['leiden_05_markers'], one or more of the cluster will have no marker genes, which I think is causing this issue.

Perhaps try recalling the markers, using: cc.tl.get_markers(adata, 'leiden_05', min_markers=1)

Hopefully there is this option, otherwise I may need to update the pypi package!

omarOMF commented 2 months ago

Thank you for your response. I tried cc.tl.get_markers(adata, 'leiden_05', min_de=1) and still running into the same error. I didn't find min_markers in the function unless I am using a different version. This is what get ` cc.tl.get_markers?

Signature:
cc.tl.get_markers(
    data: anndata._core.anndata.AnnData,
    groupby: str,
    var_groups: str = None,
    logfc_cutoff: float = 0,
    padj_cutoff: float = 0.05,
    t_cutoff: float = 3,
    n_top: int = 5,
    rerun_de: bool = True,
    gene_order=None,
    pts: bool = False,
    min_de: int = 0,
    verbose: bool = True,
)
Docstring:
Gets marker genes per cluster.

Parameters
    ----------
    data: sc.AnnData
        Single cell RNA-seq anndata, QC'd a preprocessed to log-cpm in
                                                                     data.X.
    groupby: str
        Specifies the clusters to perform one-versus-rest Welch's t-test
        comparison of genes for.
        Must specify defined column in data.obs[groupby].
        Must be categorical type.
    var_groups: str
        Specifies a column in data.var of type boolean, with True indicating
        the candidate genes to use when determining marker genes per cluster.
        Useful to, for example, remove ribosomal and mitochondrial genes.
        None indicates use all genes in data.var_names as candidates.
    logfc_cutoff: float
        Minimum logfc for a gene to be a considered a marker gene for a
        given cluster.
    marker_padj_cutoff: float
        Adjusted p-value (Benjamini-Hochberg correction) below which a gene
        can be considered a marker gene.
    t_cutoff: float
        The minimum t-value a gene must have to be considered a marker gene
        (Welch's t-statistic with one-versus-rest comparison).
    n_top: int
        The maximimum no. of marker genes per cluster.
    rerun_de: bool
        Whether to rerun the DE analysis, or using existing results in
        data.uns['rank_genes_groups']. Useful if have ran get_markers()
        with the same 'groupby' as input, but want to adjust the other
        parameters to determine marker genes.
    gene_order: str
        By default, gets n_top qualifying genes ranked by t-value.
        Specifying logfc here will rank by log-FC, instead.
    pts: bool
        Whether to calculate percentage cells expressing gene within/without
        of each cluster. Only relevant if rerun_de=True.
    min_de: int
        Minimum number of genes to use as markers, if not criteria met.
    verbose: bool
        Print statements during computation (True) or silent run (False).
    Returns
    --------
        data.uns[f'{groupby}_markers']
            Dictionary with cluster names as keys, and list of marker
            genes as values.
File:      ~/miniconda3/envs/scv/lib/python3.9/site-packages/cytocipher/score_and_merge/cluster_score.py
Type:      function

omarOMF commented 2 months ago

Also I can see there are markers selected per cluster.

caps.uns['leiden_05_markers']
{'0': array(['MT2A', 'TMSB10', 'CLDN5', 'NEAT1', 'IFITM3', 'MT1E'], dtype=object),
 '1': array(['AC083867.2', 'AC138647.2', 'AL359380.1', 'KCND3-AS1',
        'AC012555.2', 'GNG14'], dtype=object),
 '2': array(['GPCPD1', 'SPOCK3', 'SLCO1A2', 'SLC39A10', 'ABCG2', 'CA4'],
       dtype=object),
 '3': array(['OMD', 'TMEM45B', 'ATP10A', 'SLC39A10', 'ABCB1', 'THSD4'],
       dtype=object),
 '4': array(['ANO2', 'GALNT18', 'VWF', 'SCARB1', 'TNS1', 'HIF1A-AS3'],
       dtype=object),
 '5': array(['FAM43A', 'AC005083.1', 'AL928596.1', 'AC091078.2', 'BX664730.1',
        'AC083867.2'], dtype=object)}

BradBalderson commented 1 month ago

Hmm cluster markers dict looks correct, and I am looking at the source code and cannot see how the code would be throwing that error, given that there are genes in that dictionary.

Looks like the anndata object is called 'caps' in the example above? Only way I can see it would throw that error is if the marker dictionary was empty when running cc.tl.code_enrich(adata, 'leiden_05')

Are you running:

cc.tl.code_enrich(caps, 'leiden_05')

If you still get the same error, could you try running this snippet and let me know what it looks like?

    cluster_genes_dict = caps.uns['leiden_05_markers']

    # Putting all genes into array for speed.
    all_genes = []
    [all_genes.extend(cluster_genes_dict[cluster])
                                              for cluster in cluster_genes_dict]
    # Getting correct typing
    str_dtype = f"<U{max([len(gene_name) for gene_name in all_genes])}"
    all_genes = np.unique( all_genes ).astype(str_dtype)

    str_dtype_clust = f"<U{max([len(clust) for clust in cluster_genes_dict])}"

In particular, could you send the result of all_genes?

BradBalderson / Cytocipher

merge_clusters error: max() arg is an empty sequence #4