Full overlap of + and - gene sets detected and suggested to increase number of marker genes for scoring

YiweiNiu commented 1 month ago

Hi,

Thanks for this useful tool! I tried to apply Cytocipher to my data but got an error in step cc.tl.merge_clusters

Cell In[10], line 1
----> 1 cc.tl.merge_clusters(rna, new_anno, n_cpus=4)

File [/work/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py:526](https://app-5085186-0.cloud.sdu.dk/lab/tree/DevM_analysis/01.annotation/11.subclustering/HSC/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py#line=525), in merge_clusters(data, groupby, var_groups, n_top_genes, t_cutoff, marker_padj_cutoff, gene_order, min_de, enrich_method, p_cut, max_iter, mnn_frac_cutoff, k, random_state, n_cpus, score_group_method, p_adjust, p_adjust_method, squash_exception, verbose)
    520     print( "Initial merge." )
    522 get_markers(data, groupby, n_top=n_top_genes, verbose=False,
    523             var_groups=var_groups, t_cutoff=t_cutoff,
    524             padj_cutoff=marker_padj_cutoff,
    525             gene_order=gene_order, min_de=min_de)
--> 526 run_enrich(data, groupby, enrich_method, n_cpus,
    527            squash_exception=squash_exception)
    529 old_labels = data.obs[groupby].values.astype(str)
    531 merge_clusters_single(data, groupby, f'{groupby}_merged',
    532                       k=k, mnn_frac_cutoff=mnn_frac_cutoff, random_state=random_state,
    533                       p_cut=p_cut,
    534                       score_group_method=score_group_method,
    535                       p_adjust=p_adjust, p_adjust_method=p_adjust_method,
    536                       verbose=False)

File [/work/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py:419](https://app-5085186-0.cloud.sdu.dk/lab/tree/DevM_analysis/01.annotation/11.subclustering/HSC/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py#line=418), in run_enrich(data, groupby, enrich_method, n_cpus, squash_exception)
    415     raise Exception(
    416    f"Got enrich_method={enrich_method}; expected one of : {enrich_options}")
    418 if enrich_method == 'code':
--> 419     code_enrich(data, groupby, n_cpus=n_cpus, verbose=False,
    420                 squash_exception=squash_exception)
    421 elif enrich_method == 'coexpr':
    422     coexpr_enrich(data, groupby, n_cpus=n_cpus, verbose=False)

File [/work/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_score.py:628](https://app-5085186-0.cloud.sdu.dk/lab/tree/DevM_analysis/01.annotation/11.subclustering/HSC/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_score.py#line=627), in code_enrich(data, groupby, cluster_marker_key, n_cpus, min_counts, squash_exception, verbose)
    624 error = "Full overlap of + and - gene sets detected " + \
    625         f"for {cluster} and {clusterj}; suggested to " + \
    626         f"increase number of marker genes for scoring."
    627 if not squash_exception:
--> 628     raise Exception(error)
    629 else:
    630     print(error)

Exception: Full overlap of + and - gene sets detected for 14 and 16; suggested to increase number of marker genes for scoring.

It suggested to increase the number of marker genes for scoring. For now I'm using 15 and based on this thread, you mentioned that using 5-8 genes is appropriate. So, should I increase the number of marker genes for scoring? or is there another workaround for this?

Thnaks!

BradBalderson commented 1 month ago

Hi! Thanks alot for giving it ago! You can bypass this exception by:

cc.tl.merge_clusters(rna, new_anno, n_cpus=4, squash_exception=True)

I had the default behaviour to throw this exception, to flag there were no marker gene differences between two clusters, and that perhaps increasing the number of marker genes could reveal an additional gene that would differentiate the clusters.

I suppose you can try both approaches (increase marker genes and squash_exception=True) to see if it changes the results.

YiweiNiu commented 1 month ago

Hi, thank you very much for your quick reply!

I've tried both (increase marker genes to 50 and squash_exception=True), but there's a different issue

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[53], line 1
----> 1 cc.tl.merge_clusters(rna, new_anno, n_cpus=4, squash_exception=True)

File [/work/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py:580](https://app-5085186-0.cloud.sdu.dk/lab/tree/DevM_analysis/01.annotation/11.subclustering/HSC/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py#line=579), in merge_clusters(data, groupby, var_groups, n_top_genes, t_cutoff, marker_padj_cutoff, gene_order, min_de, enrich_method, p_cut, max_iter, mnn_frac_cutoff, k, random_state, n_cpus, score_group_method, p_adjust, p_adjust_method, squash_exception, verbose)
    574 get_markers(data, f'{groupby}_merged', n_top=n_top_genes, verbose=False,
    575             var_groups=var_groups, t_cutoff=t_cutoff,
    576             padj_cutoff=marker_padj_cutoff,
    577             gene_order=gene_order, min_de=min_de)
    579 # Running the enrichment scoring #
--> 580 run_enrich(data, f'{groupby}_merged', enrich_method, n_cpus,
    581            squash_exception=squash_exception)
    583 if verbose:
    584     print(f"Added data.obs[f'{groupby}_merged']")

File [/work/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py:419](https://app-5085186-0.cloud.sdu.dk/lab/tree/DevM_analysis/01.annotation/11.subclustering/HSC/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_merge.py#line=418), in run_enrich(data, groupby, enrich_method, n_cpus, squash_exception)
    415     raise Exception(
    416    f"Got enrich_method={enrich_method}; expected one of : {enrich_options}")
    418 if enrich_method == 'code':
--> 419     code_enrich(data, groupby, n_cpus=n_cpus, verbose=False,
    420                 squash_exception=squash_exception)
    421 elif enrich_method == 'coexpr':
    422     coexpr_enrich(data, groupby, n_cpus=n_cpus, verbose=False)

File [/work/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_score.py:593](https://app-5085186-0.cloud.sdu.dk/lab/tree/DevM_analysis/01.annotation/11.subclustering/HSC/home/software/anaconda3/envs/scanpy/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_score.py#line=592), in code_enrich(data, groupby, cluster_marker_key, n_cpus, min_counts, squash_exception, verbose)
    590 [all_genes.extend(cluster_genes_dict[cluster])
    591                                           for cluster in cluster_genes_dict]
    592 # Getting correct typing
--> 593 str_dtype = f"<U{max([len(gene_name) for gene_name in all_genes])}"
    594 all_genes = np.unique( all_genes ).astype(str_dtype)
    596 str_dtype_clust = f"<U{max([len(clust) for clust in cluster_genes_dict])}"

ValueError: max() arg is an empty sequence

BradBalderson commented 1 month ago

Hey @YiweiNiu, this seems to be the same issue as #4. Wondering if you could try running this code snippet as I suggested in that thread:

    cluster_genes_dict = caps.uns[f'{new_anno}_markers']

    # Putting all genes into array for speed.
    all_genes = []
    [all_genes.extend(cluster_genes_dict[cluster])
                                              for cluster in cluster_genes_dict]
    # Getting correct typing
    str_dtype = f"<U{max([len(gene_name) for gene_name in all_genes])}"
    all_genes = np.unique( all_genes ).astype(str_dtype)

    str_dtype_clust = f"<U{max([len(clust) for clust in cluster_genes_dict])}"

In particular, could you send the result of all_genes?

I have been using Cytocipher quite alot lately and don't get this error, so get a feeling is something dataset dependent, or perhaps an issue with the current version I have on pip.

BradBalderson / Cytocipher

Full overlap of + and - gene sets detected and suggested to increase number of marker genes for scoring #5