aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
163 stars 27 forks source link

plot Regulon different in plot_AUC_given_ax and sc.pl.umap #287

Open LILI-0000-0002-8173-7367 opened 5 months ago

LILI-0000-0002-8173-7367 commented 5 months ago

Hello, @SeppeDeWinter.

I was converting scplus_obj.pkl object to annData following GRN velocity tutorial.

import pandas as pd
import anndata
import sklearn
auc_key = 'eRegulon_AUC_filtered'
signature_keys = ['Gene_based', 'Region_based']
data_mat = pd.concat([scplus_obj.uns[auc_key][x].T for x in signature_keys])
selected_regulons = [x for x in data_mat.index if '_+' in x or '_-' in x]
data_mat = data_mat.loc[selected_regulons]
adata = anndata.AnnData(X=data_mat.T.copy())

However, I notice that the Regulon present very different in plot_AUC_given_ax and sc.pl.umap. The expression of Regulons in Adata are lower.

from scenicplus.dimensionality_reduction import plot_AUC_given_ax
fig, ax = plt.subplots(figsize = (8,8))
plot_AUC_given_ax(
    scplus_obj = scplus_obj,
    reduction_name = 'GEX_X_wnn.umap',
    feature = 'Nr3c2_extended_+_(49g)',
    ax = ax,
    auc_key = 'eRegulon_AUC_filtered',
    signature_key = 'Gene_based'   )
sns.despine(ax = ax)
plt.show()
sc.pl.umap(adata, color = 'Nr3c2_extended_+_(49g)')

image image

Does the inconsistence of Regulon expression affect the velocity prediction? Would you please point me the direction how to fix this? Thank you very much.

Best, Li

SeppeDeWinter commented 5 months ago

Hi @LILI-0000-0002-8173-7367

The plots indeed look very different, strange.

I don't immediately see something that is wrong with how you generated the AnnData object...

Can you compare the actual values for the specific regulon in the scplus_obj and the AnnData object?

For example


regulon = "Nr3c2_extended_+_(49g)"

scplus_obj_values_regulon = scplus_obj.uns["eRegulon_AUC_filtered"]["Gene_based"].loc[adata.obs_names,regulon]
adata_values_regulon = adata.to_df()[regulon]

print(scplus_obj_values_regulon)
print(adata_values_regulon)

# and/ or

scplus_obj_values_regulon == adata_values_regulon

# and / or

import numpy as np

np.allclose(scplus_obj_values_regulon, adata_values_regulon)

I hope this helps.

All the best,

Seppe

LILI-0000-0002-8173-7367 commented 5 months ago

Hello, Seppe.

Thanks for your reply.

Please see the attached result.

image

Best regards, Li Li

SeppeDeWinter commented 5 months ago

Hi Li Li

Indeed the values don't seem to match.

Not sure how this is possible, can you try recreating the AnnData and track where these changes might have occured?

All the best,

Seppe

bhhlee commented 4 months ago

Hi @SeppeDeWinter and @LILI-0000-0002-8173-7367

I am having the exact same error. I used the same code as Li Li's first post.

When I compare the actual values for a specific regulon in the scplus_obj and the AnnData object they are quite different.

regulon = "BATF_+_+_(79g)"

scplus_obj_values_regulon = scplus_obj.uns["eRegulon_AUC"]["Gene_based"].loc[adata.obs_names,regulon]
adata_values_regulon = adata.to_df()[regulon]

print(scplus_obj_values_regulon)
print(adata_values_regulon)

# and/ or
scplus_obj_values_regulon == adata_values_regulon

# and / or
import numpy as np
np.allclose(scplus_obj_values_regulon, adata_values_regulon)

image

Also, the difference between the scplus_obj and anndata values doesn't seem to be just lower, but rather follows a normal distribution of lower and higher values centered around 0.. (not sure why this would be - maybe there is some kind of normalization happening?)

diff_df = scplus_obj_values_regulon - adata_values_regulon
plt.hist(diff_df, density=True, bins=100) 

image

As a result, when I reanalyze the anndata and create a new embedding based on the anndata GRN AUC values, and plot the umap, all of the celltype annotations overlap whereas before they would separate nicely.

# Reanalyze this subset of cells
# Make PCA
sc.tl.pca(adata, use_highly_variable=True)
# Make diffusion map
print('DiffMap')
sc.pp.neighbors(adata)
sc.tl.diffmap(adata, random_state=5)
# Make umap
print('Umap')
sc.tl.umap(adata)
# Make tsne
print('TSNE')
sc.tl.tsne(adata)
# Make graph
print('Graph')
sc.tl.draw_graph(adata)

# Visualize
import matplotlib.pyplot as plt
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10,10))
ax1_dict = sc.pl.umap(adata, color='GEX_celltype', legend_loc='on data', ax=ax1, show=False)
ax2_dict = sc.pl.tsne(adata, color='GEX_celltype', legend_loc='on data', ax=ax2, show=False)
ax3_dict = sc.pl.diffmap(adata, color='GEX_celltype', legend_loc='on data', ax=ax3, show=False)
ax4_dict = sc.pl.draw_graph(adata, color='GEX_celltype', legend_loc='on data', ax=ax4, show=False)

image

As to where these changes occur, I think it happens between the datamatrix creation and making the anndata object. When I create the data matrix with data_mat = data_mat.loc[selected_regulons] and compare that matrix to the scplus_obj matrix they are identical.

regulon = "BATF_+_+_(79g)"

scplus_obj_values_regulon = scplus_obj.uns["eRegulon_AUC"]["Gene_based"].loc[adata.obs_names,regulon]
data_mtx_values_regulon = data_mat.T.loc[adata.obs_names,regulon]

print(scplus_obj_values_regulon)
print(data_mtx_values_regulon)

# and/ or
scplus_obj_values_regulon == data_mtx_values_regulon

# and / or
import numpy as np
np.allclose(scplus_obj_values_regulon, data_mtx_values_regulon)

image

Not sure why this would happen as all we are doing is creating the anndata from that same data matrix...

Thanks again for all of your help.

Best, Brian

bhhlee commented 4 months ago

Hi @SeppeDeWinter and @LILI-0000-0002-8173-7367

I found out what's wrong!

The scplus_obj.uns['eRegulon_AUC'] dictionary has the cell indices sorted in alphabetical order that doesn't match the scplus_obj.metadata_cell.

import pandas as pd
import anndata
import sklearn
auc_key = 'eRegulon_AUC'
signature_keys = ['Gene_based', 'Region_based']
data_mat = pd.concat([scplus_obj.uns[auc_key][x].T for x in signature_keys])
selected_regulons = [x for x in data_mat.index if '+_+' in x]
data_mat = data_mat.loc[selected_regulons]
adata = anndata.AnnData(X=data_mat.T.copy())
adata.obs = scplus_obj.metadata_cell.copy()

image image

So the actual raw values and indices get copied in the correct order when creating the adata with adata = anndata.AnnData(X=data_mat.T.copy())

But, adata.obs = scplus_obj.metadata_cell.copy() will just rename the indices, without reordering the raw data.

I got around this with a simple fix - by concatenating the scplus_obj metadata to the original adata.obs with pd.concat() forcing the concatenation by matcing indices as shown below:

import pandas as pd
import anndata
import sklearn
auc_key = 'eRegulon_AUC'
signature_keys = ['Gene_based', 'Region_based']
data_mat = pd.concat([scplus_obj.uns[auc_key][x].T for x in signature_keys])
selected_regulons = [x for x in data_mat.index if '+_+' in x]
data_mat = data_mat.loc[selected_regulons]
adata = anndata.AnnData(X=data_mat.T.copy())
adata.obs = pd.concat([adata.obs, scplus_obj.metadata_cell], axis=1)

Not sure if I made a mistake earlier when generating the scplus_obj.uns['eRegulon_AUC'], but I pretty much just followed the 10X multiome pbmc and scenic step-by-step tutorial. If other people are having the same issue, it might be worth updating the trajectory tutorials.

Anyways hope this helps!

Best, Brian

SeppeDeWinter commented 3 months ago

@bhhlee

Aha! Thank you for figuring this out. I will keep this issue open for other people to find + to remind myself to make this clear in new tutorials.

All the best,

Seppe

Umaarasu commented 3 months ago

Hi @SeppeDeWinter and @LILI-0000-0002-8173-7367

I found out what's wrong!

The scplus_obj.uns['eRegulon_AUC'] dictionary has the cell indices sorted in alphabetical order that doesn't match the scplus_obj.metadata_cell.

import pandas as pd
import anndata
import sklearn
auc_key = 'eRegulon_AUC'
signature_keys = ['Gene_based', 'Region_based']
data_mat = pd.concat([scplus_obj.uns[auc_key][x].T for x in signature_keys])
selected_regulons = [x for x in data_mat.index if '+_+' in x]
data_mat = data_mat.loc[selected_regulons]
adata = anndata.AnnData(X=data_mat.T.copy())
adata.obs = scplus_obj.metadata_cell.copy()

image image

So the actual raw values and indices get copied in the correct order when creating the adata with adata = anndata.AnnData(X=data_mat.T.copy())

But, adata.obs = scplus_obj.metadata_cell.copy() will just rename the indices, without reordering the raw data.

I got around this with a simple fix - by concatenating the scplus_obj metadata to the original adata.obs with pd.concat() forcing the concatenation by matcing indices as shown below:

import pandas as pd
import anndata
import sklearn
auc_key = 'eRegulon_AUC'
signature_keys = ['Gene_based', 'Region_based']
data_mat = pd.concat([scplus_obj.uns[auc_key][x].T for x in signature_keys])
selected_regulons = [x for x in data_mat.index if '+_+' in x]
data_mat = data_mat.loc[selected_regulons]
adata = anndata.AnnData(X=data_mat.T.copy())
adata.obs = pd.concat([adata.obs, scplus_obj.metadata_cell], axis=1)

Not sure if I made a mistake earlier when generating the scplus_obj.uns['eRegulon_AUC'], but I pretty much just followed the 10X multiome pbmc and scenic step-by-step tutorial. If other people are having the same issue, it might be worth updating the trajectory tutorials.

Anyways hope this helps!

Best, Brian

Thanks for the Info. I was about to give up not knowing why myclusters were clumped. This helped :)