azizilab / decipher

A single-cell analysis toolkit to jointly analyze samples from distinct conditions
18 stars 0 forks source link

Missing information in the tutorial dataset. #4

Open ZixiangPAN opened 9 months ago

ZixiangPAN commented 9 months ago

Hi, I found that the example dataset in the tutorial (the link below), does not have gene names in the h5ad object so that when running

dc.tl.trajectories( adata, dc.tl.TConfig("Healthy", "AVP", "MPO", "origin", "Healthy"), dc.tl.TConfig("AML1", "AVP", "CD68", "origin", "AML1"), )

dataset link: https://github.com/azizilab/decipher_data/data_decipher_tutorial.h5ad

the inside code in ../decipher/tools/trajectory_inference.py , the function find_cluster_with_marker will filter all the cells so that the adata will be void.

`def find_cluster_with_marker( adata, marker, subset_column=None, subset_value=None, subset_min_percent_per_cluster=0.3, cluster_key="decipher_clusters", min_cell_per_cluster=10, ): """Find the cluster enriched for a marker gene. Possibly subset the cells before.

Parameters
----------
adata : sc.AnnData
    The annotated data matrix.
marker : str
    The marker gene.
subset_column : str, optional
    The column in `adata.obs` to subset on.
subset_value : str, optional
    The value in subset_column to subset on.
subset_min_percent_per_cluster : float, default 0.3
    When subsetting the cells, each cluster must have at least this proportion of cells from
    the subset to not be discarded. This is useful to remove clusters with too few cells from
    the subset.
cluster_key : str, default "decipher_clusters"
    The key in `adata.obs` where the cluster information is stored.
min_cell_per_cluster : int, default 10
    The minimum number of cells per cluster to consider it.
"""
if subset_column is not None:
    adata = _subset_cells_and_clusters(
        adata,
        subset_column,
        subset_value,
        subset_min_percent_per_cluster=subset_min_percent_per_cluster,
        min_cell_per_cluster=min_cell_per_cluster,
        cluster_key=cluster_key,
    )
marker_data = pd.DataFrame(adata[:, marker].X.toarray())
marker_data["cluster"] = adata.obs[cluster_key].values
# get the proportion of cells in each cluster that are in the subset
marker_data = marker_data.groupby("cluster").mean()
marker_data = marker_data.sort_values(by=0, ascending=False)
return marker_data.index[0]`

please have a check, thank you.

Best

ANazaret commented 5 months ago

Hello, can you give more details about the problem? The example dataset does have gene names (in the attribute data.var_names).

Thanks!