aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

Unable to pick out a specific module for iRegulon #355

Closed hyjforesight closed 2 years ago

hyjforesight commented 2 years ago

Hello pySCENIC, I run the pySCENIC following the cancer dataset tutorial (http://htmlpreview.github.io/?https://github.com/aertslab/SCENICprotocol/blob/master/notebooks/SCENIC%20Protocol%20-%20Case%20study%20-%20Cancer%20data%20sets.html) and successfully generated the integrated loom file. However, it generates errors when I want to pick out the module for a specific gene for iRegulon. Please check the coding below. The error is AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?

ADJACENCIES_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.adjacencies.tsv'.format(DATASET_ID))
MOTIFS_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.motifs.csv'.format(DATASET_ID))
REGULONS_DAT_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.regulons.dat'.format(DATASET_ID))
AUCELL_MTX_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.aucell.csv'.format(DATASET_ID))
BINARY_MTX_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.binary.csv'.format(DATASET_ID))
THR_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.thresholds.csv'.format(DATASET_ID))
ANNDATA_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.h5ad'.format(DATASET_ID))
LOOM_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}_{}.loom'.format(TCGA_CODE, DATASET_ID))

...

export2loom(adata.to_df(), regulons, LOOM_FNAME,
            cell_annotations=adata.obs['pca_leiden'].to_dict(),
            embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
            auc_mtx = auc_mtx,
            tree_structure=(),
            title='{}_{}'.format(TCGA_CODE, DATASET_ID),
            nomenclature="mm10", auc_thresholds=thresholds,
            compress=True)

adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

2022-01-04 17:16:27,928 - pyscenic.utils - INFO - Calculating Pearson correlations.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19376/2303614780.py in <module>
      3 lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
      4 exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
----> 5 modules = list(modules_from_adjacencies(adjacencies, exprMat))
      6 lf.close()

~\anaconda3\envs\Python3812\lib\site-packages\pyscenic\utils.py in modules_from_adjacencies(adjacencies, ex_mtx, thresholds, top_n_targets, top_n_regulators, min_genes, absolute_thresholds, rho_dichotomize, keep_only_activating, rho_threshold, rho_mask_dropouts)
    294                 ex_mtx.columns
    295             )
--> 296             assert (
    297                 len(unique_adj_genes) == 0
    298             ), f"Found {len(unique_adj_genes)} genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?"

AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?

I guess it may be because we filtered out the genes (weight>50.0%, len(r) >= 10, NES>3) when creating regulons by below coding? But when I try to include all regulons by changing arguments to weight>0%, len(r) >= 0, NES>0, the error keeps the same.

# REGULON CREATION
def derive_regulons(motifs, db_names=('mm10__refseq-r80__10kb_up_and_down_tss.mc9nr', 'mm10__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr')):
    motifs.columns = motifs.columns.droplevel(0)

    def contains(*elems):
        def f(context):
            return any(elem in context for elem in elems)
        return f

    # We remove the enriched motifs for the modules that were created using the method 'weight>50.0%'
    # because these modules are not part of the default settings of modules_from_adjacencies anymore
    motifs = motifs[
        np.fromiter(map(compose(op.not_, contains('weight>50.0%')), motifs.Context), dtype=np.bool) & \
        np.fromiter(map(contains(*db_names), motifs.Context), dtype=np.bool) & \
        np.fromiter(map(contains('activating'), motifs.Context), dtype=np.bool)]

    # We build regulons only using enriched motifs with a NES of 3.0 or higher; we take only directly annotated TFs or TF annotated
    # for an orthologous gene into account; and we only keep regulons with at least 10 genes.
    regulons = list(filter(lambda r: len(r) >= 10, df2regulons(motifs[(motifs['NES'] >= 3) 
                                                                      & ((motifs['Annotation'] == 'gene is directly annotated')
                                                                         | (motifs['Annotation'].str.startswith('gene is orthologous to')
                                                                            & motifs['Annotation'].str.endswith('which is directly annotated for motif')))
                                                                     ])))

    # Rename regulons, i.e. remove suffix.
    return list(map(lambda r: r.rename(r.transcription_factor), regulons))

Thanks! Best, YJ

SeppeDeWinter commented 2 years ago

As the error states AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix? have you made sure that the gene expression matrix you use in the function modules_from_adjacencies is the exact same matrix you use for generating the regulons? (most of the time this is the raw gene expression matrix on which no filtering is done).

You can also check this by looking which genes in adjacencies are not present in exprMat.

SeppeDeWinter commented 2 years ago

Closing due to inactivity, feel free to open again.

hyjforesight commented 2 years ago

Hello @SeppeDeWinter , I wrote the response in a similar issue (https://github.com/aertslab/pySCENIC/issues/347). Thanks!