aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
448 stars 183 forks source link

[ERROR] modules_from_adjacencies: exprMat error #347

Closed cliftonlewis closed 2 years ago

cliftonlewis commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

Steps to reproduce the behavior

  1. Command run when the error occurred:

    from pyscenic.utils import modules_from_adjacencies
    modules = list(modules_from_adjacencies(adjacencies, exprMat))
  2. Error encountered:

    
    AssertionError                            Traceback (most recent call last)
    <ipython-input-39-235636e9fdb2> in <module>
      1 from pyscenic.utils import modules_from_adjacencies
    ----> 2 modules = list(modules_from_adjacencies(adjacencies, exprMat))

/home/shared/anaconda3/envs/scenic3/lib/python3.6/site-packages/pyscenic/utils.py in modules_from_adjacencies(adjacencies, ex_mtx, thresholds, top_n_targets, top_n_regulators, min_genes, absolute_thresholds, rho_dichotomize, keep_only_activating, rho_threshold, rho_mask_dropouts) 267 # test for genes present in the adjacencies but not present in the expression matrix: 268 unique_adj_genes = set(adjacencies[COLUMN_NAME_TF]).union(set(adjacencies[COLUMN_NAME_TARGET])) - set(ex_mtx.columns) --> 269 assert len(unique_adj_genes)==0, f"Found {len(unique_adj_genes)} genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?" 270 LOGGER.warn(f"Note on correlation calculation: the default behaviour for calculating the correlations has changed after pySCENIC verion 0.9.16. Previously, the default was to calculate the correlation between a TF and target gene using only cells with non-zero expression values (mask_dropouts=True). The current default is now to use all cells to match the behavior of the R verision of SCENIC. The original settings can be retained by setting 'rho_mask_dropouts=True' in the modules_from_adjacencies function, or '--mask_dropouts' from the CLI.\n\tDropout masking is currently set to [{rho_mask_dropouts}].") 271 adjacencies = add_correlation(adjacencies, ex_mtx,

AssertionError: Found 1 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?



**Expected behavior**
A clear and concise description of what you expected to happen.
I have run the code as per the previous pipeline so I don't really understand the error to well as it was using the same expression matrix. I have seen a previous issue #103 which discusses the same problem and it seemed at that time it was an issue with arboreto which was then fixed so I'm not too sure about the source of my error. Thank you in advance for all your help. 
**Please complete the following information:**
- pySCENIC version: 0.11.1
- Installation method: Conda
- Run environment: Jupyter notebook
- OS: Linux
cflerin commented 2 years ago

@cliftonlewis,

This error means that there was a gene found in the network output (adjacencies file) that is not present in your expression matrix. You must be using a different expression matrix for this step vs. the network inference step -- there's no other explanation (ok, there could a weird bug in this check step).

You have two options: 1) re-run the network inference on your new expression matrix, or 2) remove the extra gene(s) from your adjacencies file (take all genes in the TF and Target columns of your adjacencies and compare with your expression matrix genes, find the one that shouldn't be there, and remove it from the network output file). I would strongly suggest option 1 since there's no way to know what else changed with the expression matrix.

cliftonlewis commented 2 years ago

Thank you very much @cflerin

I have spent the last week trying every which to redo the network inference and I'm still not able to get this working. I tried the other option of just removing the gene of issue and it does fix it but as you said that's not the ideal method. I'm still not sure why this would be the case and I double-checked that the expression matrix is the right one etc. I may just have to forgo this last step for visualisation

hyjforesight commented 2 years ago

Hello @cflerin @cliftonlewis , Same issue here. All procedures start from ADJACENCIES_FNAME and I exported the final result into loom file LOOM_FNAME.

export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
            cell_annotations=adata.obs['pca_leiden'].to_dict(), tree_structure=(),
            title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
            embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
            auc_mtx = auc_mtx, auc_thresholds=thresholds,
            compress=True)

Then I reload the ADJACENCIES_FNAME into adjacencies. exprMat is also loaded from LOOM_FNAME, but running modules_from_adjacencies creates errors.

adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?

I also submit this issue here. https://github.com/aertslab/pySCENIC/issues/355

SeppeDeWinter commented 2 years ago

@hyjforesight could you show part of the data contained in the adjacencies and exprMat variable, for example using the head command.

Using the following command you can get the genes which are only present in the adjacencies but not in the expression matrix. This might help with troubleshooting as well.


unique_adj_genes = set(adjacencies["TF"]).union(set(adjacencies["target"])) - set(ex_mtx.columns)
hyjforesight commented 2 years ago

Hello @SeppeDeWinter , Thanks for the recommendations. Sorry for the late response, I was busy with some wet experimental things.

By calling adjacencies and exprMat, I realize that adjacencies was generated from the whole genes, whereas exprMat was generated from highly variable genes. I did a view of my adata and use this view for generating UMAP and then embed with the AUCell. Please see below.

Because it's a common way to only use the highly variable genes for generating a UMAP, but GRNBoost2 requires the whole genes, the gene numbers of adjacencies and exprMat will not be matched. In this case, how do we solve the AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix error? Thanks! Best, YJ

f_loom_path_scenic includes the whole genes for GRNBoost2, generating adjacencies.

# preprocessing by eliminating some unwanted cells, all genes are included
sc.pl.highest_expr_genes(adata, n_top=20)
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=25)
adata.var['mt'] = adata.var_names.str.startswith('mt-')
adata.var['rpl'] = adata.var_names.str.startswith('Rpl')
adata.var['rps'] = adata.var_names.str.startswith('Rps')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :]
adata = adata[adata.obs.pct_counts_rpl < 50, :]
adata = adata[adata.obs.pct_counts_rps < 50, :]

# write the preprocessed data into f_loom_path_scenic
row_attrs = {"Gene": np.array(adata.var_names)}
col_attrs = {"CellID": np.array(adata.obs_names), "nGene": np.array(np.sum(adata.X.transpose()>0, axis=0)).flatten(),
             "nUMI": np.array(np.sum(adata.X.transpose(), axis=0)).flatten()}
lp.create(f_loom_path_scenic, adata.X.transpose(), row_attrs, col_attrs)

# use f_loom_path_scenic for GRNBoost2
!pyscenic grn {f_loom_path_scenic} {Mouse_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 16

LOOM_FNAME only includes the highly_variable_genes, generating exprMat.

# do a view of adata by highly_variable_genes and generate UMAP
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)
print(sum(adata.var.highly_variable))
adata.raw=adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=0.4)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data', frameon=False, title='', use_raw=False)

# after doing all SCENIC steps, export into loom, but this loom is embedded with UMAP drawn by highly_variable_genes
export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
            cell_annotations=adata.obs['leiden'].to_dict(), tree_structure=(),
            title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
            embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
            auc_mtx = auc_mtx, auc_thresholds=thresholds,
            compress=True)

# exprMat was generated by this loom file
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
SeppeDeWinter commented 2 years ago

Hi @hyjforesight

Indeed it won't work with the matrix which only includes highly variable genes.

If f_loom_path_scenic contains the matrix with all genes you should use this one.

i.e.

lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
hyjforesight commented 2 years ago

Hello @SeppeDeWinter It's so kind of you! As you recommend, f_loom_path_scenic works. However, it generates TypeError: list indices must be integers or slices, not str.

# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]

for i,mod in enumerate( tf_mods ):
    print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons[tf])} genes' )

Atoh1 module 0: 1390 genes
Atoh1 module 1: 772 genes
Atoh1 module 2: 51 genes
Atoh1 module 3: 218 genes
Atoh1 module 4: 329 genes
Atoh1 module 5: 1027 genes
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21964/1801393588.py in <module>
      5 for i,mod in enumerate( tf_mods ):
      6     print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
----> 7 print( f'{tf} regulon: {len(regulons[tf])} genes' )

TypeError: list indices must be integers or slices, not str

Here is the output of regulons

[Regulon(name='Ahr', gene2weight=frozendict.frozendict({'4933434E20Rik': 1.2266120420120072, 'Tsc22d2': 2.493458558861111, 'Relb': 1.6059775174853306, 'Wdr81': 2.7312633394303125, 'N4bp2': 2.517364206954634, 'Gm15417': 2.9100744069765594, 'Mrm3': 1.8340456841873831, 'Slc43a2': 3.2958363969906035, 'Ahr': 1.0, 'Pwp2': 1.4280092164572429, 'Snapc1': 1.375446330644375, 'Rrp8': 1.1964840576238376}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ahr', context=frozenset({'cisbp__M5217.png', 'activating'}), score=1.1945444591398073, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Arntl', gene2weight=frozendict.frozendict({'Gtf2h1': 1.620336555726875, 'Slc35a5': 1.91533670349294, 'Stxbp2': 1.8737801640073328, 'Slc20a1': 2.6734488198151336, 'Mms19': 1.5346505049314778, 'Zfp750': 3.07825326637646, 'Trpv3': 1.3858996748844017, 'Lmo7': 1.4820359038217124, 'Ppfia4': 2.2181361459582285, 'Vps11': 3.3435825773521883, 'Per2': 5.276266659342296, 'Net1': 1.6967910742666223, 'Miga2': 2.6992598513071764, 'Bhlhe40': 1.048126353591973, 'Cnnm1': 0.2176312578590142, 'Zfp296': 0.5270087719933082}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Arntl', context=frozenset({'transfac_pro__M08868.png', 'activating'}), score=3.6772977258300505, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Atf1', gene2weight=frozendict.frozendict({'Haus5': 2.696609854293601, 'Ccna2': 3.966379215412084, 'Adnp2': 3.1412411963973, 'Knl1': 2.978216740685239, 'Zfp574': 2.6709956233650742, 'Gstcd': 4.495503046617849, 'Akt2': 7.703748410076394, 'Wdr33': 4.470218914364348, 'Osbpl8': 4.128940542758419, 'Dmtf1': 2.9518457104833984, 'Ecd': 4.228484233174275, 'Tufm': 3.241183423254145, 'Atf1': 1.0, 'Tmem79': 3.039202512848713, 'Txndc12': 2.7651144726669266, 'H2-T23': 4.091599231055049, 'Sars2': 4.88815935652636, 'Hist2h2ac': 6.218816342974511, 'Espl1': 5.891555428730409, 'Trim59': 5.5231863844652365, 'Repin1': 7.575504985608425, 'Cdc6': 4.835485993983268, 'Atp5g2': 6.370272443998833, 'Mycbpap': 1.786616119915864, 'Zranb3': 1.884314316529295}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf1', context=frozenset({'jaspar__MA0604.1.png', 'activating'}), score=3.7760690161205934, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Atf2', gene2weight=frozendict.frozendict({'Ube2h': 0.897783367787763, 'Atxn1l': 1.0011890012392768, 'Spag4': 0.0504887189046225, 'Tsc22d2': 1.446490814474661, 'Fam53c': 1.0264002590265495, 'Celf3': 0.8921666004374457, 'Baz2b': 1.7387011414932263, 'Sstr2': 0.1305991991641035, 'Ppp1r15a': 2.255760431335529, 'Ccdc86': 1.4541805544558348, 'Per3': 0.2268716251155102, 'Tmem86a': 1.3049516716894871, 'Fdxacb1': 1.2384143650749797, 'Ift57': 0.1866952283639881, 'Dnajb4': 2.4881353785712856, 'Ift81': 0.6129619623801797, 'Nedd1': 2.330723795868671, 'Tfcp2': 0.9635833030927822, 'Atg2b': 0.6862142465533663, 'Pum3': 1.0244760901333978, 'A130010J15Rik': 0.6386723714882772, 'Cdc37': 1.8154579196928788, 'Topors': 1.6841496409710848, 'Gskip': 1.033190551910205, 'Ldb1': 1.0867077410261163, 'Arf4': 0.8804604227117785, 'Ppard': 1.789885033526931, 'U2af1l4': 0.9695135646637782, 'Cry2': 0.7857601469517936, 'Ppp1r2': 1.9509700635671885, 'Washc2': 2.6742612959268044, 'Hexim2': 0.3199586402604067, 'Tapt1': 1.2681447687682663, 'Tmcc3': 0.9065785116044712, 'Srrt': 1.1198520353424557, 'Fkbp7': 0.5068524674075681, 'Egr3': 0.6450683762001174, 'Mthfr': 1.6912746112244794, 'Slc16a12': 1.1161316547263636, 'Ccdc173': 0.3230739469914306}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf2', context=frozenset({'cisbp__M3088.png', 'activating'}), score=3.376461889441437, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
...
SeppeDeWinter commented 2 years ago

Hi

You're trying to index a list using a string key, this is not possible.

You should first convert the list of regulons to a dictionary (see tutorial):

# create a dictionary of regulons:
regulons = {}
for i,r in pd.DataFrame(lf.ra.Regulons,index=lf.ra.Gene).iteritems():
    regulons[i] =  list(r[r==1].index.values)
hyjforesight commented 2 years ago

Holy crap! You guys @SeppeDeWinter who're doing the coding things are so genius, my friend! You're the real PhD, not me (old guys sitting on the bench and doing pipetting).

Here attaches the coding I learn from @SeppeDeWinter. Hope it helps @cliftonlewis.

# Step 8: Cytoscape-iRegulon exploration of modules
adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
adjacencies

# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

# create a dictionary of regulons for converting the index from string to values:
lf2 = lp.connect(LOOM_FNAME, mode='r', validate=False)
regulons_dict = {}
for i,r in pd.DataFrame(lf2.ra.Regulons, index=lf2.ra.Gene).iteritems():
    regulons_dict[i] =  list(r[r==1].index.values)
lf2.close()

# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]

for i,mod in enumerate( tf_mods ):
    print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons_dict[tf])} genes' )

# write these modules, and the regulon to files
for i,mod in enumerate( tf_mods ):
    with open( tf+'_module_'+str(i)+'.txt', 'w') as f:
        for item in mod.genes:
            f.write("%s\n" % item)

with open( tf+'_regulon.txt', 'w') as f:
    for item in regulons_dict[tf]:
        f.write("%s\n" % item)
SeppeDeWinter commented 2 years ago

Happy to help! Closing the issue for now, feel free to open again if there are further issues.

Seppe