ziyangliu93 commented 4 years ago

Hi everyone, I am trying to use the pyscenic workflow to infer grn on my 10X scRNA dataset. I completely followed the demonstration workflow posted on SCENICprotocol cancer data sets using SKCM scRNA data. The problem is, when I ran pyscenic ctx subcommand, I got the following error when writing results to files and no matter how I checked, I have no idea what happened.

734 2019-11-19 23:44:53,454 - pyscenic.transform - WARNING - Less than 80% of the genes in TBX3 could be mapped to hg38refseq-r80500 735 736 2019-11-19 23:44:53,579 - pyscenic.transform - WARNING - Less than 80% of the genes in TBX6 could be mapped to hg38refseq-r80500 737 738 2019-11-19 23:45:01,445 - pyscenic.transform - WARNING - Less than 80% of the genes in HES2 could be mapped to hg38refseq-r80500 739 740 2019-11-19 23:45:01,590 - pyscenic.transform - WARNING - Less than 80% of the genes in TEF could be mapped to hg38refseq-r80500b 741 742 2019-11-19 23:45:04,100 - pyscenic.transform - WARNING - Less than 80% of the genes in HES5 could be mapped to hg38refseq-r80500 743 744 2019-11-19 23:45:07,137 - pyscenic.transform - WARNING - Less than 80% of the genes in HIC1 could be mapped to hg38refseq-r80500 745 746 2019-11-19 23:45:08,539 - pyscenic.transform - WARNING - Less than 80% of the genes in ESR2 could be mapped to hg38refseq-r80500 747 748 2019-11-19 23:49:29,864 - pyscenic.cli.pyscenic - INFO - Writing results to file. 749 Traceback (most recent call last): 750 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/frame.py", line 3540, in _ensure_valid_inde 751 value = Series(value) 752 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/series.py", line 316, in init 753 data = SingleBlockManager(data, index, fastpath=True) 754 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1516, in ini 755 block = make_block(block, placement=slice(0, len(axis)), ndim=1) 756 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3284, in make_bl 757 return klass(values, ndim=ndim, placement=placement) 758 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2792, in init_ 759 super().init(values, ndim=ndim, placement=placement) 760 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 128, in init 761 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs)) 762 ValueError: Wrong number of items passed 8, placement implies 0 763 ...skipping... 750 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/frame.py", line 3540, in _ensure_valid_inde 751 value = Series(value) 752 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/series.py", line 316, in init 753 data = SingleBlockManager(data, index, fastpath=True) 754 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1516, in ini 755 block = make_block(block, placement=slice(0, len(axis)), ndim=1) 756 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3284, in make_bl 757 return klass(values, ndim=ndim, placement=placement) 758 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2792, in init_ 759 super().init(values, ndim=ndim, placement=placement) 760 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 128, in init 761 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs)) 762 ValueError: Wrong number of items passed 8, placement implies 0 763 764 During handling of the above exception, another exception occurred: 765 766 Traceback (most recent call last): 767 File "/Bailab3/liuziyang/Program/ShellScript/pyscenic/pyscenic.case.study.v22.py", line 177, in 768 regulons = derive_regulons(df_motifs) 769 File "/Bailab3/liuziyang/Program/ShellScript/pyscenic/pyscenic.case.study.v22.py", line 173, in derive_regulons 770 regulons = list(filter(lambda r: len(r) >= 10, df2regulons(motifs[(motifs['NES'] >= 3.0) & ((motifs['Annotation'] == 'gene is d 771 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pyscenic/transform.py", line 297, in df2regulons 772 df[COLUMN_NAME_TYPE] = df.apply(get_type,axis=1) 773 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/frame.py", line 3487, in setitem 774 self._set_item(key, value) 775 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/frame.py", line 3563, in _set_item 776 self._ensure_valid_index(value) 777 File "/home/liuziyang/miniconda3/envs/pyscenic/lib/python3.7/site-packages/pandas/core/frame.py", line 3543, in _ensure_valid_inde 778 "Cannot set a frame with no defined index " 779 ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

Please help

bramvds commented 4 years ago

Dear,

The problem seems to be in creation of regulons from the enriched motif table. A temporary and quick workaround could be the use the CLI command for the cisTarget step to create the enriched motif table instead of the regulons and then use some Python code to the transformation to regulons (cf https://github.com/aertslab/pySCENIC/issues/105#issuecomment-553561902).

To generate the enriched motif table from the CLI use a the -o option from pyscenic ctx subtask with '.csv' or '.tsv' as the filename extension:

--output Output file/stream, i.e. a table of enriched motifs and target genes (csv, tsv) or collection of regulons (yaml, gmt, dat, json).

Kindest regards, Bram

ziyangliu93 commented 4 years ago

Thank you for your reply. But I already tried that. Here is my code, hopefully you will find something odd here.

import os, glob, re, pickle from functools import partial from collections import OrderedDict import operator as op from cytoolz import compose

import pandas as pd import seaborn as sns import numpy as np import scanpy as sc import anndata as ad import matplotlib as mpl import matplotlib.pyplot as plt

from pyscenic.export import export2loom, add_scenic_metadata from pyscenic.utils import load_motifs from pyscenic.transform import df2regulons from pyscenic.aucell import aucell from pyscenic.binarization import binarize from pyscenic.rss import regulon_specificity_scores from pyscenic.plotting import plot_binarization, plot_rss

from IPython.display import HTML, display

------------------------------------------------------------------------------------------

Set maximum number of jobs for Scanpy.

sc.settings.njobs = 20

Specify folders.

RESOURCES_FOLDERNAME = "/ailab7/PROJECT/ziyang/CC_hetero/scRNA/pyscenic/resources/" #TPM matrix and cell annotation AUXILLIARIES_FOLDERNAME = "/ailab3/ziyang/Database/SCENIC/" #Feather file RESULTS_FOLDERNAME = "/ailab7/PROJECT/ziyang/CC_hetero/scRNA/pyscenic/results/" # For results FIGURES_FOLDERNAME = "/ailab7/PROJECT/ziyang/CC_hetero/scRNA/pyscenic/figures/" # For plots sc.settings.figdir = FIGURES_FOLDERNAME

Auxilliary functions.

BASE_URL = "http://motifcollections.aertslab.org/v9/logos/" COLUMN_NAME_LOGO = "MotifLogo" COLUMN_NAME_MOTIF_ID = "MotifID" COLUMN_NAME_TARGETS = "TargetGenes"

def savesvg(fname: str, fig, folder: str=FIGURES_FOLDERNAME) -> None: """ Save figure as vector-based SVG image format. """ fig.tight_layout() fig.savefig(os.path.join(folder, fname), format='svg')

def display_logos(df: pd.DataFrame, top_target_genes: int = 3, base_url: str = BASE_URL): """ :param df: :param base_url: """

Make sure the original dataframe is not altered.

df = df.copy()

# Add column with URLs to sequence logo.
def create_url(motif_id):
    return '<img src="{}{}.png" style="max-height:124px;"></img>'.format(base_url, motif_id)
df[("Enrichment", COLUMN_NAME_LOGO)] = list(map(create_url, df.index.get_level_values(COLUMN_NAME_MOTIF_ID)))

# Truncate TargetGenes.
def truncate(col_val):
    return sorted(col_val, key=op.itemgetter(1))[:top_target_genes]
df[("Enrichment", COLUMN_NAME_TARGETS)] = list(map(truncate, df[("Enrichment", COLUMN_NAME_TARGETS)]))

MAX_COL_WIDTH = pd.get_option('display.max_colwidth')
pd.set_option('display.max_colwidth', -1)
display(HTML(df.head().to_html(escape=False)))
pd.set_option('display.max_colwidth', MAX_COL_WIDTH)

Auxilliary data sets.

HUMAN_TFS_FNAME = os.path.join(AUXILLIARIES_FOLDERNAME, 'hs_hgnc_curated_tfs.txt') RANKING_DBS_FNAMES = list(map(lambda fn: os.path.join(AUXILLIARIES_FOLDERNAME, fn), ['hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather','hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr.feather'])) MOTIF_ANNOTATIONS_FNAME = os.path.join(AUXILLIARIES_FOLDERNAME, 'motifs-v9-nr.hgnc-m0.001-o0.0.tbl')

Data acquisition and result storage.

DATASET_ID = "P11" CODE = 'CC' CELL_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDERNAME, "CC_cell.annotations.csv")

SAMPLE_METADATA_FNAME = os.path.join(RESOURCES_FOLDERNAME, "1-s2.0-S0092867418311784-mmc1.xlsx")

EXP_MTX_TPM_FNAME = os.path.join(RESOURCES_FOLDERNAME, 'CC_tpm.csv')

EXP_MTX_COUNTS_FNAME = os.path.join(RESOURCES_FOLDERNAME, 'GSE115978_counts.csv')

METADATA_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.metadata.csv'.format(DATASET_ID)) EXP_MTX_QC_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.qc.tpm.csv'.format(DATASET_ID)) ADJACENCIES_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.adjacencies.tsv'.format(DATASET_ID)) MOTIFS_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.motifs.csv'.format(DATASET_ID)) REGULONS_DAT_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.regulons.dat'.format(DATASET_ID)) AUCELL_MTX_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.auc.csv'.format(DATASET_ID)) BIN_MTX_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.bin.csv'.format(DATASET_ID)) THR_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.thresholds.csv'.format(DATASET_ID)) ANNDATA_FNAME = os.path.join(RESULTS_FOLDERNAME, '{}.h5ad'.format(DATASET_ID)) LOOM_FNAME = os.path.join(RESULTSFOLDERNAME, '{}{}.loom'.format(CODE, DATASET_ID))

STEP 0: Preprocessing.

STEP 0.1: METADATA CLEANING. Optional

df_annotations = pd.read_csv(CELL_ANNOTATIONS_FNAME)

df_annotations['samples'] = df_annotations['samples'].str.upper()

df_annotations.rename(columns={'cell.types': 'cell_type', 'cells': 'cell_id', 'samples': 'sample_id', 'treatment.group': 'treatment_group', 'Cohort': 'cohort'}, inplace=True)

df_annotations['cell_type'] = df_annotations.cell_type.replace({

'Mal': 'Malignant Cell',

'Endo.': 'Endothelial Cell',

'T.CD4': 'Thelper Cell',

'CAF': 'Fibroblast',

'T.CD8': 'Tcytotoxic Cell',

'T.cell': 'T Cell',

'NK': 'NK Cell',

'B.cell': 'B Cell'})

df_samples = pd.read_excel(SAMPLE_METADATA_FNAME, header=2)

df_samples = df_samples.drop(['Cohort'], axis=1)

df_samples['Sample'] = df_samples.Sample.str.upper()

df_metadata = pd.merge(df_annotations, df_samples, left_on='sample_id', right_on='Sample')

df_metadata = df_metadata.drop(['Sample', 'treatment_group'], axis=1)

df_metadata.rename(columns={'Patient': 'patient_id',

'Age': 'age', 'Sex': 'sex', 'Lesion type': 'lesion_type', 'Site': 'site',

'Treatment': 'treatment', 'Treatment group': 'treatment_group'}, inplace=True)

df_annotations.to_csv(METADATA_FNAME, index=False)

STEP 0.2: EXPRESSION MATRIX QC

df_counts = pd.read_csv(EXP_MTX_COUNTS_FNAME, index_col=0)

df_tpm = pd.read_csv(EXP_MTX_TPM_FNAME, index_col=0) adata = sc.AnnData(X=df_tpm.T.sort_index())

df_obs = df_annotations[['Cell', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'Sample','Cell_type']].set_index('Cell').sort_index()

adata.obs = df_obs

adata.var_names_make_unique()

$sc.pp.filter_cells(adata, min_genes=200)

sc.pp.filter_genes(adata, min_cells=5)

Store non-log transformed data as raw. This data can be used via the use_raw parameters available for many functions.

In the scanpy's tutorials this is used to stored all genes in log-transformed counts before retaining only Highly Variable Genes (HVG).

Because in this case no filtering is done we use this feature to store raw counts.

adata.original = adata

sc.pp.log1p(adata)

adata.write_h5ad(ANNDATA_FNAME) # Categorical dtypes are created. adata.to_df().to_csv(EXP_MTX_QC_FNAME)

STEP 1: Network inference based on GRNBoost2 from CLI.

For this step the CLI version of SCENIC is used. This step can be deployed on an High Performance Computing system.

Output: List of adjacencies between a TF and its targets stored in ADJACENCIES_FNAME.

commands_grn = 'pyscenic grn {} {} -o {} --num_workers 30'.format(EXP_MTX_QC_FNAME, HUMAN_TFS_FNAME, ADJACENCIES_FNAME) os.system(commands_grn)

pyscenic grn {EXP_MTX_QC_FNAME} {HUMAN_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 10

STEP 2-3: Regulon prediction aka cisTarget from CLI.

For this step the CLI version of SCENIC is used. This step can be deployed on an High Performance Computing system.

Output: List of adjacencies between a TF and its targets stored in MOTIFS_FNAME.

DBS_PARAM = ' '.join(RANKING_DBS_FNAMES)

DBS_PARAM = '/Bailab3/liuziyang/Database/SCENIC/hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather /Bailab3/liuziyang/Database/SCENIC/hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr.feather'

commands_ctx = 'pyscenic ctx {} {} --annotations_fname {} --expression_mtx_fname {} -o {} --num_workers 30'.format(ADJACENCIES_FNAME, DBS_PARAM, MOTIF_ANNOTATIONS_FNAME, EXP_MTX_QC_FNAME, MOTIFS_FNAME)

commands_ctx = 'pyscenic ctx -o {} --annotations_fname {} --num_workers 30 --expression_mtx_fname {} {} {}'.format(MOTIFS_FNAME, MOTIF_ANNOTATIONS_FNAME, EXP_MTX_QC_FNAME, ADJACENCIES_FNAME, DBS_PARAM) os.system(commands_ctx)

df_motifs = load_motifs(MOTIFS_FNAME)

STEP 4: Cellular enrichment aka AUCell.

REGULON CREATION.

def derive_regulons(motifs, db_names=('hg38refseq-r8010kb_up_and_down_tss', 'hg38refseq-r80500bp_up_and_100bp_down_tss')): motifs.columns = motifs.columns.droplevel(0) def contains(*elems): def f(context): return any(elem in context for elem in elems) return f

For the creation of regulons we only keep the 10-species databases and the activating modules. We also remove the

# enriched motifs for the modules that were created using the method 'weight>50.0%' (because these modules are not part
# of the default settings of modules_from_adjacencies anymore.
motifs = motifs[
    np.fromiter(map(compose(op.not_, contains('weight>50.0%')), motifs.Context), dtype=np.bool) & \
    np.fromiter(map(contains(*db_names), motifs.Context), dtype=np.bool) & \
    np.fromiter(map(contains('activating'), motifs.Context), dtype=np.bool)]
# We build regulons only using enriched motifs with a NES of 3.0 or higher; we take only directly annotated TFs or TF annotated
# for an orthologous gene into account; and we only keep regulons with at least 10 genes.
regulons = list(filter(lambda r: len(r) >= 10, df2regulons(motifs[(motifs['NES'] >= 3.0)  & ((motifs['Annotation'] == 'gene is directly annotated') | (motifs['Annotation'].str.startswith('gene is orthologous to') & motifs['Annotation'].str.endswith('which is directly annotated for motif')))])))
# Rename regulons, i.e. remove suffix.
return list(map(lambda r: r.rename(r.transcription_factor), regulons))

regulons = derive_regulons(df_motifs)

Pickle these regulons.

with open(REGULONS_DAT_FNAME, 'wb') as f: pickle.dump(regulons, f)

AUCELL.

auc_mtx = aucell(df_tpm.T, regulons, num_workers=20) auc_mtx.to_csv(AUCELL_MTX_FNAME) auc_mtx = pd.read_csv(AUCELL_MTX_FNAME, index_col=0)

OPTIONAL STEP 5 - Regulon activity binarization.

bin_mtx, thresholds = binarize(auc_mtx) bin_mtx.to_csv(BIN_MTX_FNAME) thresholds.to_frame().rename(columns={0:'threshold'}).to_csv(THR_FNAME) bin_mtx = pd.read_csv(BIN_MTX_FNAME, index_col=0) thresholds = pd.read_csv(THR_FNAME, index_col=0).threshold

AUC score of selected genes across cells.

fig, ((ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8)) = plt.subplots(2, 4, figsize=(8, 4), dpi=300) plot_binarization(auc_mtx, 'NFKB2', thresholds['NFKB2'], ax=ax1) plot_binarization(auc_mtx, 'MITF', thresholds['MITF'], ax=ax2) plot_binarization(auc_mtx, 'FOXP3', thresholds['FOXP3'], ax=ax3) plot_binarization(auc_mtx, 'PAX5', thresholds['PAX5'], ax=ax4) plot_binarization(auc_mtx, 'IRF8', thresholds['IRF8'], ax=ax5) plot_binarization(auc_mtx, 'IRF3', thresholds['IRF3'], ax=ax6) plot_binarization(auc_mtx, 'MLX', thresholds['MLX'], ax=ax7) plot_binarization(auc_mtx, 'YY1', thresholds['YY1'], ax=ax8) plt.tight_layout() savesvg('hists-ICC_P11-binarization.svg', fig)

Create heatmap with binarized regulon activity.

def palplot(pal, names, colors=None, size=1): n = len(pal) f, ax = plt.subplots(1, 1, figsize=(n size, size)) ax.imshow(np.arange(n).reshape(1, n), cmap=mpl.colors.ListedColormap(list(pal)), interpolation="nearest", aspect="auto") ax.set_xticks(np.arange(n) - .5) ax.set_yticks([-.5, .5]) ax.set_xticklabels([]) ax.set_yticklabels([]) colors = n ['k'] if colors is None else colors for idx, (name, color) in enumerate(zip(names, colors)): ax.text(0.0+idx, 0.0, name, color=color, horizontalalignment='center', verticalalignment='center') return f

N_COLORS = len(adata.obs.Sample.dtype.categories) COLORS = [color['color'] for color in mpl.rcParams["axes.prop_cycle"]] cell_type_color_lut = dict(zip(adata.obs.Sample.dtype.categories, COLORS))

cell_type_color_lut = dict(zip(adata.obs.cell_type.dtype.categories, adata.uns['cell_type_colors']))

cell_id2cell_type_lut = df_metadata.set_index('Cell').cell_type.to_dict() bw_palette = sns.xkcd_palette(["white", "black"]) sns.set() sns.set_style("whitegrid") fig = palplot(bw_palette, ['OFF', 'ON'], ['k', 'w']) savesvg('legend-CC_P11-on_off.svg', fig)

sns.set() sns.set(font_scale=1.0) sns.set_style("ticks", {"xtick.minor.size": 1, "ytick.minor.size": 0.1}) g = sns.clustermap(bin_mtx.T, col_colors=auc_mtx.index.map(cell_id2cell_type_lut).map(cell_type_color_lut), cmap=bw_palette, figsize=(20,20)) g.ax_heatmap.set_xticklabels([]) g.ax_heatmap.set_xticks([]) g.ax_heatmap.set_xlabel('Cells') g.ax_heatmap.set_ylabel('Regulons') g.ax_col_colors.set_yticks([0.5]) g.ax_col_colors.set_yticklabels(['Cell Type']) g.cax.set_visible(False) g.fig.savefig(os.path.join(FIGURES_FOLDERNAME, 'clustermap-ICC_P11.png'), format='png')

Save clustered binarized heatmap to Excel for further inspection.

bin_mtx_clustered = bin_mtx.T.copy() bin_mtx_clustered.rename(columns=df_annotations.set_index('Cell')['Sample'].to_dict(), inplace=True) bin_mtx_clustered.iloc[g.dendrogram_row.reordered_ind, g.dendrogram_col.reordered_ind].to_excel(os.path.join(RESULTS_FOLDERNAME, 'ICC_P11-Binarized regulon activity.xlsx'))

Generate sequence logos.

def fetch_logo(regulon, base_url = BASE_URL): for elem in regulon.context: if elem.endswith('.png'): return ''.format(base_url, elem) return ""

df_regulons = pd.DataFrame(data=[list(map(op.attrgetter('name'), regulons)), list(map(len, regulons)), list(map(fetch_logo, regulons))], index=['name', 'count', 'logo']).T

MAX_COL_WIDTH = pd.get_option('display.max_colwidth') pd.set_option('display.max_colwidth', -1)

display(HTML(df_regulons.head().to_html(escape=False)))

pd.set_option('display.max_colwidth', MAX_COL_WIDTH)

STEP 6: Non-linear projection and clustering.

In this step a scanpy.AnnData object is created containing all metadata and results.

First, we select and retain only the most variable genes in the dataset.

PCA + tSNE PROJECTION.

sns.set() sc.pp.highly_variable_genes(adata) sc.pl.highly_variable_genes(adata) adata = adata[:, adata.var['highly_variable']]

Then we apply a linear-dimensional reduction technique (PCA).

sc.tl.pca(adata, svd_solver='arpack') sc.pl.pca(adata, color='CD8A') sc.pl.pca_variance_ratio(adata, log=False)

Followed by a tSNE projection.

sc.tl.tsne(adata) sc.set_figure_params(frameon=False, dpi=300, fontsize=8) sc.pl.tsne(adata, color=['Sample'], title=['CC_P11-Sample', 'CC_P11-{} Samples'.format(len(adata.obs.Sample.unique()))], palette=COLORS, save='CC_P11-PCA+tSNE.svg')

We capture the non-linear projection based on PCA+tSNE for later storage in the loom file.

embedding_pca_tsne = pd.DataFrame(adata.obsm['X_tsne'], columns=[['_X', '_Y']], index=adata.obs_names)

We add all metadata derived from SCENIC to the scanpy.AnnData object.

add_scenic_metadata(adata, auc_mtx, regulons) adata.write_h5ad(ANNDATA_FNAME)

AUCELL + tSNE PROJECTION.

We change the tSNE projection so that it relies on AUCell instead of PCA.

sc.tl.tsne(adata, use_rep='X_aucell') embedding_aucell_tsne = pd.DataFrame(adata.obsm['X_tsne'], columns=[['_X', '_Y']], index=adata.obs_names) sc.set_figure_params(frameon=False, dpi=300, fontsize=8) sc.pl.tsne(adata, color=['cell_type', 'lesion_type', 'patient_id'], title=['CC_P11-Sample', 'CC_P11-{} Samples'.format(len(adata.obs.Sample.unique()))], palette=COLORS, save='ICC_P11-AUCell+tSNE.svg')

CELL TYPE SPECIFIC REGULATORS - Z-SCORE.

To find cell type specific regulators we use a Z score (i.e. the average AUCell score for the cells of a give type are standardized using the overall average AUCell scores and its standard deviation).

df_obs = adata.obs signature_column_names = list(df_obs.select_dtypes('number').columns) signature_column_names = list(filter(lambda s: s.startswith('Regulon('), signature_column_names)) df_scores = df_obs[signature_column_names + ['Sample']] df_results = ((df_scores.groupby(by='Sample').mean() - df_obs[signature_column_names].mean())/ df_obs[signature_column_names].std()).stack().reset_index().rename(columns={'level_1': 'regulon', 0:'Z'}) df_results['regulon'] = list(map(lambda s: s[8:-1], df_results.regulon)) df_results[(df_results.Z >= 3.0)].sort_values('Z', ascending=False)

df_heatmap = pd.pivot_table(data=df_results[df_results.Z >= 3.0].sort_values('Z', ascending=False), index='cell_type', columns='regulon', values='Z')

df_heatmap.drop(index='Myocyte', inplace=True) # We leave out Myocyte because many TFs are highly enriched (becuase of small number of cells).

fig, ax1 = plt.subplots(1, 1, figsize=(10, 8)) sns.heatmap(df_heatmap, ax=ax1, annot=True, fmt=".1f", linewidths=.7, cbar=False, square=True, linecolor='gray', cmap="YlGnBu", annot_kws={"size": 6}) ax1.set_ylabel('') savesvg('heatmap-ICC_P11-regulons.svg', fig)

CELL TYPE SPECIFIC REGULATORS - RSS.

rss = regulon_specificity_scores(auc_mtx, adata.obs.Sample)

sns.set()

sns.set(style='whitegrid', font_scale=0.8)

fig, ((ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8)) = plt.subplots(2, 4, figsize=(8, 6), dpi=300)

plot_rss(rss, 'B Cell', ax=ax1)

ax1.set_xlabel('')

plot_rss(rss, 'T Cell', ax=ax2)

ax2.set_xlabel('')

ax2.set_ylabel('')

plot_rss(rss, 'Thelper Cell', ax=ax3)

ax3.set_xlabel('')

ax3.set_ylabel('')

plot_rss(rss, 'Tcytotoxic Cell', ax=ax4)

ax4.set_xlabel('')

ax4.set_ylabel('')

plot_rss(rss, 'NK Cell', ax=ax5)

plot_rss(rss, 'Fibroblast', ax=ax6)

ax6.set_ylabel('')

plot_rss(rss, 'Macrophage', ax=ax7)

ax7.set_ylabel('')

plot_rss(rss, 'Endothelial Cell', ax=ax8)

ax8.set_ylabel('')

plt.tight_layout()

savesvg('plots - GSE115978 - rss.svg', fig)

sc.set_figure_params(frameon=False, dpi=300, fontsize=8)

sc.pl.tsne(adata, color=['Sample', 'PAX5', 'TCF7', 'EOMES', 'TBX21', 'PRRX2', 'MAFB'],

title=['GSE115978 - SKCM - Cell types', 'PAX5', 'TCF7', 'EOMES', 'TBX21', 'PRRX2', 'MAFB'], ncols=4, use_raw=False, save=' - GSE115978 - rss_gene.svg', palette=COLORS, cmap='magma')

#

sc.set_figure_params(frameon=False, dpi=300, fontsize=8)

sc.pl.tsne(adata, color=['Sample', 'Regulon(PAX5)', 'Regulon(TCF7)', 'Regulon(EOMES)', 'Regulon(TBX21)', 'Regulon(PRRX2)', 'Regulon(MAFB)', 'Regulon(SOX7)'],

title=['GSE115978 - SKCM - Cell types', 'Regulon(PAX5)', 'Regulon(TCF7)', 'Regulon(EOMES)', 'Regulon(TBX21)', 'Regulon(PRRX2)', 'Regulon(MAFB)', 'Regulon(SOX7)'], ncols=4, use_raw=False,

save=' - GSE115978 - rss_regulon.svg', palette=COLORS, cmap='viridis')

#

SELECTED REGULONS.

With the metadata from SCENIC add to the AnnData object, we can create cellular scatterplots for regulon activity. In the example we compare MITF regulon enrichment with the expression of the MITF gene.

sc.pl.tsne(adata, color=['cell_type', 'Regulon(MITF)', 'Regulon(TWIST1)'],

title=['GSE115978 - SKCM - Cell types', 'Regulon(MITF)', 'Regulon(TWIST1)'], ncols=3, use_raw=False, save=' - GSE115978 - selected_regulons1.svg', palette=COLORS, cmap='viridis')

sc.pl.tsne(adata, color=['cell_type', 'MITF'],

title=['GSE115978 - SKCM - Cell types', 'MITF'], ncols=3, use_raw=False, save=' - GSE115978 - selected_regulons2.svg', palette=COLORS, cmap='magma')

#

sc.set_figure_params(frameon=False, dpi=300, fontsize=8)

sc.pl.tsne(adata, color=['cell_type', 'Regulon(MYC)'],

title=['GSE115978 - SKCM - Cell types', 'MYC - Regulon'], ncols=3, palette=COLORS, use_raw=False, save=' - GSE115978 - AUCell+tSNE - Regulon MYC.svg', cmap='viridis')

#

AUCELL DISTRIBUTIONS.

We can also analyze the distribution of AUCell values via the plots in the scanpy package.

df_results.sort_values('Z', ascending=False).groupby(by='cell_type').head(2)

aucell_adata = sc.AnnData(X=auc_mtx.sort_index())

aucell_adata.obs = df_obs

names = list(map(op.attrgetter('name'), filter(lambda r: r.score > 8.0, regulons)))

sc.pl.stacked_violin(aucell_adata, names, groupby='cell_type', save=' - GSE115978 - regulons.svg')

STEP 7: Export to SCope.

export2loom(adata.to_df(), regulons, LOOM_FNAME, cell_annotations=adata.obs['Sample'].to_dict(), embeddings=OrderedDict([('AUCell + tSNE', embedding_aucell_tsne), ('PCA + tSNE', embedding_pca_tsne)]), auc_mtx = auc_mtx, treestructure=(), title='{}{}'.format(CODE, DATASET_ID), nomenclature="HGNC", auc_thresholds=thresholds, compress=True)

I commented out some codes here for use in my case.

kentsingxk commented 4 years ago

Dear,

The problem seems to be in creation of regulons from the enriched motif table. A temporary and quick workaround could be the use the CLI command for the cisTarget step to create the enriched motif table instead of the regulons and then use some Python code to the transformation to regulons (cf #105 (comment)).

To generate the enriched motif table from the CLI use a the -o option from pyscenic ctx subtask with '.csv' or '.tsv' as the filename extension:
--output Output file/stream, i.e. a table of enriched motifs and target genes (csv, tsv) or collection of regulons (yaml, gmt, dat, json).
Kindest regards, Bram

Dear Bram! I also encountered the same problem in the step "REGULON CREATION". I run the pipeline completely followed the cancer case study, and when i run:

!pyscenic ctx {ADJACENCIES_FNAME} {DBS_PARAM} --annotations_fname {MOTIF_ANNOTATIONS_FNAME} --expression_mtx_fname {EXP_MTX_QC_FNAME} --output {MOTIFS_FNAME} --num_workers 26 --mask_dropouts

df_motifs = load_motifs(MOTIFS_FNAME)

regulons = derive_regulons(df_motifs)

#################### I got the following error messages:

ValueError Traceback (most recent call last) ~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/frame.py in _ensure_valid_index(self, value) 3539 try: -> 3540 value = Series(value) 3541 except (ValueError, NotImplementedError, TypeError):

~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/series.py in init(self, data, index, dtype, name, copy, fastpath) 315 --> 316 data = SingleBlockManager(data, index, fastpath=True) 317

~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/internals/managers.py in init(self, block, axis, do_integrity_check, fastpath) 1515 if not isinstance(block, Block): -> 1516 block = make_block(block, placement=slice(0, len(axis)), ndim=1) 1517

~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath) 3283 -> 3284 return klass(values, ndim=ndim, placement=placement) 3285

~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/internals/blocks.py in init(self, values, placement, ndim) 2791 -> 2792 super().init(values, ndim=ndim, placement=placement) 2793

~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/internals/blocks.py in init(self, values, placement, ndim) 127 "Wrong number of items passed {val}, placement implies " --> 128 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs)) 129 )

ValueError: Wrong number of items passed 8, placement implies 0

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)

in ----> 1 regulons = derive_regulons(df_motifs) in derive_regulons(motifs, db_names) 22 & ((motifs['Annotation'] == 'gene is directly annotated') 23 | (motifs['Annotation'].str.startswith('gene is orthologous to') ---> 24 & motifs['Annotation'].str.endswith('which is directly annotated for motif'))) 25 ]))) 26 ~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pyscenic/transform.py in df2regulons(df, save_columns) 316 # Activating is the default! 317 return REPRESSING_MODULE if REPRESSING_MODULE in ctx else ACTIVATING_MODULE --> 318 df[COLUMN_NAME_TYPE] = df.apply(get_type,axis=1) 319 320 # Group all rows per TF and type (+)/(-). Each group results in a single regulon. ~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value) 3485 else: 3486 # set column -> 3487 self._set_item(key, value) 3488 3489 def _setitem_slice(self, key, value): ~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/frame.py in _set_item(self, key, value) 3561 """ 3562 -> 3563 self._ensure_valid_index(value) 3564 value = self._sanitize_column(key, value) 3565 NDFrame._set_item(self, key, value) ~/miniconda3/envs/scenic_protocol/lib/python3.6/site-packages/pandas/core/frame.py in _ensure_valid_index(self, value) 3541 except (ValueError, NotImplementedError, TypeError): 3542 raise ValueError( -> 3543 "Cannot set a frame with no defined index " 3544 "and a value that cannot be converted to a " 3545 "Series" ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series ------------------------------------------------------------------ It seems something wrong with the derive_regulons steps. Please help Many thanks!

rlittman16 commented 3 years ago

I have the same problem

adding this option helped. --mask_dropouts

aertslab / pySCENIC

pyscenic ctx unknown error #107

from IPython.display import HTML, display

------------------------------------------------------------------------------------------

Set maximum number of jobs for Scanpy.

Specify folders.

Auxilliary functions.

Make sure the original dataframe is not altered.

Auxilliary data sets.

Data acquisition and result storage.

SAMPLE_METADATA_FNAME = os.path.join(RESOURCES_FOLDERNAME, "1-s2.0-S0092867418311784-mmc1.xlsx")

EXP_MTX_COUNTS_FNAME = os.path.join(RESOURCES_FOLDERNAME, 'GSE115978_counts.csv')

STEP 0: Preprocessing.

STEP 0.1: METADATA CLEANING. Optional

df_annotations['samples'] = df_annotations['samples'].str.upper()

df_annotations.rename(columns={'cell.types': 'cell_type', 'cells': 'cell_id', 'samples': 'sample_id', 'treatment.group': 'treatment_group', 'Cohort': 'cohort'}, inplace=True)

df_annotations['cell_type'] = df_annotations.cell_type.replace({

'Mal': 'Malignant Cell',

'Endo.': 'Endothelial Cell',

'T.CD4': 'Thelper Cell',

'CAF': 'Fibroblast',

'T.CD8': 'Tcytotoxic Cell',

'T.cell': 'T Cell',

'NK': 'NK Cell',

'B.cell': 'B Cell'})

df_samples = pd.read_excel(SAMPLE_METADATA_FNAME, header=2)

df_samples = df_samples.drop(['Cohort'], axis=1)

df_samples['Sample'] = df_samples.Sample.str.upper()

df_metadata = pd.merge(df_annotations, df_samples, left_on='sample_id', right_on='Sample')

df_metadata = df_metadata.drop(['Sample', 'treatment_group'], axis=1)

df_metadata.rename(columns={'Patient': 'patient_id',

'Age': 'age', 'Sex': 'sex', 'Lesion type': 'lesion_type', 'Site': 'site',

'Treatment': 'treatment', 'Treatment group': 'treatment_group'}, inplace=True)

STEP 0.2: EXPRESSION MATRIX QC

df_counts = pd.read_csv(EXP_MTX_COUNTS_FNAME, index_col=0)

df_obs = df_annotations[['Cell', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'Sample','Cell_type']].set_index('Cell').sort_index()

adata.obs = df_obs

$sc.pp.filter_cells(adata, min_genes=200)

Store non-log transformed data as raw. This data can be used via the use_raw parameters available for many functions.

In the scanpy's tutorials this is used to stored all genes in log-transformed counts before retaining only Highly Variable Genes (HVG).

Because in this case no filtering is done we use this feature to store raw counts.

adata.original = adata

sc.pp.log1p(adata)

STEP 1: Network inference based on GRNBoost2 from CLI.

For this step the CLI version of SCENIC is used. This step can be deployed on an High Performance Computing system.

Output: List of adjacencies between a TF and its targets stored in ADJACENCIES_FNAME.

pyscenic grn {EXP_MTX_QC_FNAME} {HUMAN_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 10

STEP 2-3: Regulon prediction aka cisTarget from CLI.

For this step the CLI version of SCENIC is used. This step can be deployed on an High Performance Computing system.

Output: List of adjacencies between a TF and its targets stored in MOTIFS_FNAME.

DBS_PARAM = ' '.join(RANKING_DBS_FNAMES)

commands_ctx = 'pyscenic ctx {} {} --annotations_fname {} --expression_mtx_fname {} -o {} --num_workers 30'.format(ADJACENCIES_FNAME, DBS_PARAM, MOTIF_ANNOTATIONS_FNAME, EXP_MTX_QC_FNAME, MOTIFS_FNAME)

STEP 4: Cellular enrichment aka AUCell.

REGULON CREATION.

For the creation of regulons we only keep the 10-species databases and the activating modules. We also remove the

Pickle these regulons.

AUCELL.

OPTIONAL STEP 5 - Regulon activity binarization.

AUC score of selected genes across cells.

Create heatmap with binarized regulon activity.

cell_type_color_lut = dict(zip(adata.obs.cell_type.dtype.categories, adata.uns['cell_type_colors']))

Save clustered binarized heatmap to Excel for further inspection.

Generate sequence logos.

display(HTML(df_regulons.head().to_html(escape=False)))

STEP 6: Non-linear projection and clustering.

In this step a scanpy.AnnData object is created containing all metadata and results.

First, we select and retain only the most variable genes in the dataset.

PCA + tSNE PROJECTION.

Then we apply a linear-dimensional reduction technique (PCA).

Followed by a tSNE projection.

We capture the non-linear projection based on PCA+tSNE for later storage in the loom file.

We add all metadata derived from SCENIC to the scanpy.AnnData object.

AUCELL + tSNE PROJECTION.

We change the tSNE projection so that it relies on AUCell instead of PCA.

CELL TYPE SPECIFIC REGULATORS - Z-SCORE.

To find cell type specific regulators we use a Z score (i.e. the average AUCell score for the cells of a give type are standardized using the overall average AUCell scores and its standard deviation).

df_heatmap.drop(index='Myocyte', inplace=True) # We leave out Myocyte because many TFs are highly enriched (becuase of small number of cells).

CELL TYPE SPECIFIC REGULATORS - RSS.

rss = regulon_specificity_scores(auc_mtx, adata.obs.Sample)

sns.set()