Closed cliftonlewis closed 2 years ago
@cliftonlewis,
This error means that there was a gene found in the network output (adjacencies file) that is not present in your expression matrix. You must be using a different expression matrix for this step vs. the network inference step -- there's no other explanation (ok, there could a weird bug in this check step).
You have two options: 1) re-run the network inference on your new expression matrix, or 2) remove the extra gene(s) from your adjacencies file (take all genes in the TF and Target columns of your adjacencies and compare with your expression matrix genes, find the one that shouldn't be there, and remove it from the network output file). I would strongly suggest option 1 since there's no way to know what else changed with the expression matrix.
Thank you very much @cflerin
I have spent the last week trying every which to redo the network inference and I'm still not able to get this working. I tried the other option of just removing the gene of issue and it does fix it but as you said that's not the ideal method. I'm still not sure why this would be the case and I double-checked that the expression matrix is the right one etc. I may just have to forgo this last step for visualisation
Hello @cflerin @cliftonlewis ,
Same issue here.
All procedures start from ADJACENCIES_FNAME
and I exported the final result into loom file LOOM_FNAME
.
export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
cell_annotations=adata.obs['pca_leiden'].to_dict(), tree_structure=(),
title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
auc_mtx = auc_mtx, auc_thresholds=thresholds,
compress=True)
Then I reload the ADJACENCIES_FNAME
into adjacencies
. exprMat
is also loaded from LOOM_FNAME
, but running modules_from_adjacencies
creates errors.
adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?
I also submit this issue here. https://github.com/aertslab/pySCENIC/issues/355
@hyjforesight could you show part of the data contained in the adjacencies
and exprMat
variable, for example using the head command.
Using the following command you can get the genes which are only present in the adjacencies but not in the expression matrix. This might help with troubleshooting as well.
unique_adj_genes = set(adjacencies["TF"]).union(set(adjacencies["target"])) - set(ex_mtx.columns)
Hello @SeppeDeWinter , Thanks for the recommendations. Sorry for the late response, I was busy with some wet experimental things.
By calling adjacencies
and exprMat
, I realize that adjacencies
was generated from the whole genes, whereas exprMat
was generated from highly variable genes. I did a view of my adata
and use this view for generating UMAP and then embed with the AUCell. Please see below.
Because it's a common way to only use the highly variable genes for generating a UMAP, but GRNBoost2 requires the whole genes, the gene numbers of adjacencies
and exprMat
will not be matched. In this case, how do we solve the AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix
error?
Thanks!
Best,
YJ
f_loom_path_scenic
includes the whole genes for GRNBoost2, generating adjacencies
.
# preprocessing by eliminating some unwanted cells, all genes are included
sc.pl.highest_expr_genes(adata, n_top=20)
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=25)
adata.var['mt'] = adata.var_names.str.startswith('mt-')
adata.var['rpl'] = adata.var_names.str.startswith('Rpl')
adata.var['rps'] = adata.var_names.str.startswith('Rps')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :]
adata = adata[adata.obs.pct_counts_rpl < 50, :]
adata = adata[adata.obs.pct_counts_rps < 50, :]
# write the preprocessed data into f_loom_path_scenic
row_attrs = {"Gene": np.array(adata.var_names)}
col_attrs = {"CellID": np.array(adata.obs_names), "nGene": np.array(np.sum(adata.X.transpose()>0, axis=0)).flatten(),
"nUMI": np.array(np.sum(adata.X.transpose(), axis=0)).flatten()}
lp.create(f_loom_path_scenic, adata.X.transpose(), row_attrs, col_attrs)
# use f_loom_path_scenic for GRNBoost2
!pyscenic grn {f_loom_path_scenic} {Mouse_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 16
LOOM_FNAME
only includes the highly_variable_genes
, generating exprMat
.
# do a view of adata by highly_variable_genes and generate UMAP
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)
print(sum(adata.var.highly_variable))
adata.raw=adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=0.4)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data', frameon=False, title='', use_raw=False)
# after doing all SCENIC steps, export into loom, but this loom is embedded with UMAP drawn by highly_variable_genes
export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
cell_annotations=adata.obs['leiden'].to_dict(), tree_structure=(),
title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
auc_mtx = auc_mtx, auc_thresholds=thresholds,
compress=True)
# exprMat was generated by this loom file
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
Hi @hyjforesight
Indeed it won't work with the matrix which only includes highly variable genes.
If f_loom_path_scenic
contains the matrix with all genes you should use this one.
i.e.
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
Hello @SeppeDeWinter
It's so kind of you!
As you recommend, f_loom_path_scenic
works. However, it generates TypeError: list indices must be integers or slices, not str
.
# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]
for i,mod in enumerate( tf_mods ):
print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons[tf])} genes' )
Atoh1 module 0: 1390 genes
Atoh1 module 1: 772 genes
Atoh1 module 2: 51 genes
Atoh1 module 3: 218 genes
Atoh1 module 4: 329 genes
Atoh1 module 5: 1027 genes
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21964/1801393588.py in <module>
5 for i,mod in enumerate( tf_mods ):
6 print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
----> 7 print( f'{tf} regulon: {len(regulons[tf])} genes' )
TypeError: list indices must be integers or slices, not str
Here is the output of regulons
[Regulon(name='Ahr', gene2weight=frozendict.frozendict({'4933434E20Rik': 1.2266120420120072, 'Tsc22d2': 2.493458558861111, 'Relb': 1.6059775174853306, 'Wdr81': 2.7312633394303125, 'N4bp2': 2.517364206954634, 'Gm15417': 2.9100744069765594, 'Mrm3': 1.8340456841873831, 'Slc43a2': 3.2958363969906035, 'Ahr': 1.0, 'Pwp2': 1.4280092164572429, 'Snapc1': 1.375446330644375, 'Rrp8': 1.1964840576238376}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ahr', context=frozenset({'cisbp__M5217.png', 'activating'}), score=1.1945444591398073, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Arntl', gene2weight=frozendict.frozendict({'Gtf2h1': 1.620336555726875, 'Slc35a5': 1.91533670349294, 'Stxbp2': 1.8737801640073328, 'Slc20a1': 2.6734488198151336, 'Mms19': 1.5346505049314778, 'Zfp750': 3.07825326637646, 'Trpv3': 1.3858996748844017, 'Lmo7': 1.4820359038217124, 'Ppfia4': 2.2181361459582285, 'Vps11': 3.3435825773521883, 'Per2': 5.276266659342296, 'Net1': 1.6967910742666223, 'Miga2': 2.6992598513071764, 'Bhlhe40': 1.048126353591973, 'Cnnm1': 0.2176312578590142, 'Zfp296': 0.5270087719933082}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Arntl', context=frozenset({'transfac_pro__M08868.png', 'activating'}), score=3.6772977258300505, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Atf1', gene2weight=frozendict.frozendict({'Haus5': 2.696609854293601, 'Ccna2': 3.966379215412084, 'Adnp2': 3.1412411963973, 'Knl1': 2.978216740685239, 'Zfp574': 2.6709956233650742, 'Gstcd': 4.495503046617849, 'Akt2': 7.703748410076394, 'Wdr33': 4.470218914364348, 'Osbpl8': 4.128940542758419, 'Dmtf1': 2.9518457104833984, 'Ecd': 4.228484233174275, 'Tufm': 3.241183423254145, 'Atf1': 1.0, 'Tmem79': 3.039202512848713, 'Txndc12': 2.7651144726669266, 'H2-T23': 4.091599231055049, 'Sars2': 4.88815935652636, 'Hist2h2ac': 6.218816342974511, 'Espl1': 5.891555428730409, 'Trim59': 5.5231863844652365, 'Repin1': 7.575504985608425, 'Cdc6': 4.835485993983268, 'Atp5g2': 6.370272443998833, 'Mycbpap': 1.786616119915864, 'Zranb3': 1.884314316529295}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf1', context=frozenset({'jaspar__MA0604.1.png', 'activating'}), score=3.7760690161205934, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Atf2', gene2weight=frozendict.frozendict({'Ube2h': 0.897783367787763, 'Atxn1l': 1.0011890012392768, 'Spag4': 0.0504887189046225, 'Tsc22d2': 1.446490814474661, 'Fam53c': 1.0264002590265495, 'Celf3': 0.8921666004374457, 'Baz2b': 1.7387011414932263, 'Sstr2': 0.1305991991641035, 'Ppp1r15a': 2.255760431335529, 'Ccdc86': 1.4541805544558348, 'Per3': 0.2268716251155102, 'Tmem86a': 1.3049516716894871, 'Fdxacb1': 1.2384143650749797, 'Ift57': 0.1866952283639881, 'Dnajb4': 2.4881353785712856, 'Ift81': 0.6129619623801797, 'Nedd1': 2.330723795868671, 'Tfcp2': 0.9635833030927822, 'Atg2b': 0.6862142465533663, 'Pum3': 1.0244760901333978, 'A130010J15Rik': 0.6386723714882772, 'Cdc37': 1.8154579196928788, 'Topors': 1.6841496409710848, 'Gskip': 1.033190551910205, 'Ldb1': 1.0867077410261163, 'Arf4': 0.8804604227117785, 'Ppard': 1.789885033526931, 'U2af1l4': 0.9695135646637782, 'Cry2': 0.7857601469517936, 'Ppp1r2': 1.9509700635671885, 'Washc2': 2.6742612959268044, 'Hexim2': 0.3199586402604067, 'Tapt1': 1.2681447687682663, 'Tmcc3': 0.9065785116044712, 'Srrt': 1.1198520353424557, 'Fkbp7': 0.5068524674075681, 'Egr3': 0.6450683762001174, 'Mthfr': 1.6912746112244794, 'Slc16a12': 1.1161316547263636, 'Ccdc173': 0.3230739469914306}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf2', context=frozenset({'cisbp__M3088.png', 'activating'}), score=3.376461889441437, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
...
Hi
You're trying to index a list using a string key, this is not possible.
You should first convert the list of regulons to a dictionary (see tutorial):
# create a dictionary of regulons:
regulons = {}
for i,r in pd.DataFrame(lf.ra.Regulons,index=lf.ra.Gene).iteritems():
regulons[i] = list(r[r==1].index.values)
Holy crap! You guys @SeppeDeWinter who're doing the coding things are so genius, my friend! You're the real PhD, not me (old guys sitting on the bench and doing pipetting).
Here attaches the coding I learn from @SeppeDeWinter. Hope it helps @cliftonlewis.
# Step 8: Cytoscape-iRegulon exploration of modules
adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
adjacencies
# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
# create a dictionary of regulons for converting the index from string to values:
lf2 = lp.connect(LOOM_FNAME, mode='r', validate=False)
regulons_dict = {}
for i,r in pd.DataFrame(lf2.ra.Regulons, index=lf2.ra.Gene).iteritems():
regulons_dict[i] = list(r[r==1].index.values)
lf2.close()
# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]
for i,mod in enumerate( tf_mods ):
print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons_dict[tf])} genes' )
# write these modules, and the regulon to files
for i,mod in enumerate( tf_mods ):
with open( tf+'_module_'+str(i)+'.txt', 'w') as f:
for item in mod.genes:
f.write("%s\n" % item)
with open( tf+'_regulon.txt', 'w') as f:
for item in regulons_dict[tf]:
f.write("%s\n" % item)
Happy to help! Closing the issue for now, feel free to open again if there are further issues.
Seppe
Describe the bug A clear and concise description of what the bug is.
Steps to reproduce the behavior
Command run when the error occurred:
Error encountered:
/home/shared/anaconda3/envs/scenic3/lib/python3.6/site-packages/pyscenic/utils.py in modules_from_adjacencies(adjacencies, ex_mtx, thresholds, top_n_targets, top_n_regulators, min_genes, absolute_thresholds, rho_dichotomize, keep_only_activating, rho_threshold, rho_mask_dropouts) 267 # test for genes present in the adjacencies but not present in the expression matrix: 268 unique_adj_genes = set(adjacencies[COLUMN_NAME_TF]).union(set(adjacencies[COLUMN_NAME_TARGET])) - set(ex_mtx.columns) --> 269 assert len(unique_adj_genes)==0, f"Found {len(unique_adj_genes)} genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?" 270 LOGGER.warn(f"Note on correlation calculation: the default behaviour for calculating the correlations has changed after pySCENIC verion 0.9.16. Previously, the default was to calculate the correlation between a TF and target gene using only cells with non-zero expression values (mask_dropouts=True). The current default is now to use all cells to match the behavior of the R verision of SCENIC. The original settings can be retained by setting 'rho_mask_dropouts=True' in the modules_from_adjacencies function, or '--mask_dropouts' from the CLI.\n\tDropout masking is currently set to [{rho_mask_dropouts}].") 271 adjacencies = add_correlation(adjacencies, ex_mtx,
AssertionError: Found 1 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?