run_scenicplus() error before inferring TF to gene relationships

Describe the bug Hi, I am trying to use SCENIC+ with mouse data. I ran through the entire pbmc tutorial with the wrapper function and got everything to work. When I get to run_scenicplus() with my own data though, I get an error right before inferring TF to gene relationships and then SCENIC+ stops running.

To Reproduce


import dill
import scanpy as sc
import os
import warnings
warnings.filterwarnings("ignore")
import pandas
import pyranges
import pickle
# Set stderr to null to avoid strange messages from ray
import sys
_stderr = sys.stderr                                                         
null = open(os.devnull,'wb')
work_dir =  '/scratch/jruiz/SCENIC_test'
save_dir = work_dir

adata = sc.read_h5ad(os.path.join('data/seu_all_samples.h5ad'))

cistopic_obj = dill.load(open(os.path.join("scATAC/cistopic_obj_modeled.pkl"), 'rb'))

menr = dill.load(open(os.path.join(save_dir, 'motif_enrichment_results/menr.pkl'), 'rb'))

from scenicplus.scenicplus_class import create_SCENICPLUS_object
import numpy as np

scplus_obj = create_SCENICPLUS_object(
  GEX_anndata=adata.raw.to_adata(),
  cisTopic_obj=cistopic_obj,
  menr=menr,
  bc_transform_func= lambda x: f'{x}-10x'
)

scplus_obj.X_EXP = np.array(scplus_obj.X_EXP.todense())
scplus_obj

from scenicplus.scenicplus_class import create_SCENICPLUS_object

biomart_host = "http://nov2020.archive.ensembl.org"

#only keep the first two columns of the PCA embedding in order to be able to visualize this in SCope
scplus_obj.dr_cell['GEX_X_pca'] = scplus_obj.dr_cell['GEX_X_pca'].iloc[:, 0:2]
scplus_obj.dr_cell['GEX_X_umap'] = scplus_obj.dr_cell['GEX_X_umap'].iloc[:, 0:2]

tmp_dir = '/scratch/temp'

from scenicplus.wrappers.run_scenicplus import run_scenicplus
try:
  run_scenicplus(
    scplus_obj = scplus_obj,
    variable = ['GEX_celltype'],
    species = 'mmusculus',
    assembly = 'mm10',
    tf_file = 'mouse_tf_gerg.gsc.riken.jp.txt',
    save_path = os.path.join(save_dir, 'scenicplus'),
    biomart_host = biomart_host,
    upstream = [1000, 150000],
    downstream = [1000, 150000],
    calculate_TF_eGRN_correlation = True,
    calculate_DEGs_DARs = True,
    export_to_loom_file = True,
    export_to_UCSC_file = True,
    path_bedToBigBed = 'pbmc_tutorial',
    n_cpu = 8,
    _temp_dir = tmp_dir
  )

Error output


2024-04-10 10:05:23,251 cisTopic     INFO     Impute region accessibility for regions 760000-780000
2024-04-10 10:05:25,402 cisTopic     INFO     Impute region accessibility for regions 780000-800000
2024-04-10 10:05:27,477 cisTopic     INFO     Impute region accessibility for regions 800000-820000
2024-04-10 10:05:29,648 cisTopic     INFO     Impute region accessibility for regions 820000-840000
2024-04-10 10:05:31,890 cisTopic     INFO     Impute region accessibility for regions 840000-860000
2024-04-10 10:05:33,987 cisTopic     INFO     Impute region accessibility for regions 860000-880000
2024-04-10 10:05:36,086 cisTopic     INFO     Impute region accessibility for regions 880000-900000
2024-04-10 10:05:38,210 cisTopic     INFO     Impute region accessibility for regions 900000-920000
2024-04-10 10:05:40,282 cisTopic     INFO     Done!
2024-04-10 10:06:25,058 SCENIC+_wrapper INFO     Created folder : /scratch/jruiz/SCENIC_test/scenicplus
2024-04-10 10:06:25,058 SCENIC+_wrapper INFO     Merging cistromes
2024-04-10 10:09:54,757 SCENIC+_wrapper INFO     Getting search space
2024-04-10 10:09:58,889 R2G          INFO     Downloading gene annotation from biomart dataset: mmusculus_gene_ensembl
2024-04-10 10:10:08,895 R2G          INFO     Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes
2024-04-10 10:10:09,549 R2G          INFO     Extending promoter annotation to 10 bp upstream and 10 downstream
Warning! Start and End columns now have different dtypes: int32 and int64
Warning! Start and End columns now have different dtypes: int32 and int64
2024-04-10 10:10:10,547 R2G          INFO     Extending search space to:
                                    150000 bp downstream of the end of the gene.
                                    150000 bp upstream of the start of the gene.
Warning! Start and End columns now have different dtypes: int32 and int64
Warning! Start and End columns now have different dtypes: int32 and int64
2024-04-10 10:10:15,369 R2G          INFO     Intersecting with regions.
Warning! Start and End columns now have different dtypes: int32 and int64
2024-04-10 10:10:18,170 R2G          INFO     Calculating distances from region to gene
2024-04-10 10:17:56,510 R2G          INFO     Imploding multiple entries per region and gene
2024-04-10 10:32:54,814 R2G          INFO     Done!
2024-04-10 10:32:56,469 SCENIC+_wrapper INFO     Inferring region to gene relationships
2024-04-10 10:32:57,386 R2G          INFO     Calculating region to gene importances, using GBM method
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
2024-04-10 10:33:21,764 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
initializing: 0it [00:00, ?it/s]
Running using 8 cores: 0it [00:00, ?it/s]2024-04-10 10:33:25,499 R2G          INFO     Took 28.11260724067688 seconds
2024-04-10 10:33:25,499 R2G          INFO     Calculating region to gene correlation, using SR method

2024-04-10 10:33:32,685 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
initializing: 0it [00:00, ?it/s]
Running using 8 cores: 0it [00:00, ?it/s]2024-04-10 10:33:35,291 R2G          INFO     Took 9.791584491729736 seconds
An error occured!
2024-04-10 10:33:35,366 SCENIC+_wrapper INFO     Inferring TF to gene relationships

Traceback (most recent call last):
  File "scenic_plus_run.py", line 78, in <module>
    raise(e)
NameError: name 'e' is not defined

Expected behavior For SCENIC+ to continue running

Version (please complete the following information):

Python: 3.8.18
SCENIC+: 1.0.1.dev6+ge5ba6fc

Hi @jenelysRO

This seems to be an error coming from your script scenic_plus_run.py line 78.


File "scenic_plus_run.py", line 78, in <module>
    raise(e)

I guess you have a try except statement there?

This should look like.


try:
   <CODE>
except Exception as e:
  raise(e)

Best,

Seppe

Thanks for your quick response! I ran the code without the try and Exception snippet and got rid of the error. However, now I run into another error:


2024-04-11 15:38:30,410 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
initializing: 0it [00:00, ?it/s]
Running using 8 cores: 0it [00:00, ?it/s]2024-04-11 15:38:33,028 R2G          INFO     Took 8.398192167282104 seconds
An error occured!
2024-04-11 15:38:33,079 SCENIC+_wrapper INFO     Inferring TF to gene relationships

Traceback (most recent call last):
  File "scenic_plus_run.py", line 63, in <module>
    run_scenicplus(
  File "/tools/python/3.8.18.GCC10/lib/python3.8/site-packages/scenicplus/wrappers/run_scenicplus.py", line 156, in run_scenicplus
    calculate_TFs_to_genes_relationships(scplus_obj, 
  File "/tools/python/3.8.18.GCC10/lib/python3.8/site-packages/scenicplus/TF_to_gene.py", line 283, in calculate_TFs_to_genes_relationships
    ex_matrix, gene_names, tf_names = _prepare_input(
  File "/tools/python/3.8.18.GCC10/lib/python3.8/site-packages/arboreto/algo.py", line 229, in _prepare_input
    raise ValueError('Intersection of gene_names and tf_names is empty.')
ValueError: Intersection of gene_names and tf_names is empty.

I checked my TF file and it is formatted similar to the one in the pbmc tutorial, and the scenic object looks alright too and has all the necessary slots. How could I fix this?

Hi @jenelysRO

First of all I would strongly recommend to use the development branch of SCENIC+ (I'm going to switch that branch to the new branch in the next week, so the code you are using now is going to get deprecated. Just as a heads up). See: https://github.com/aertslab/scenicplus/tree/development for more info.

Can you run the following code and show the output please:


from arboreto.utils import load_tf_names
tf_names = load_tf_names(<TF_FILE>)

print(tf_names[0:10])

gene_names = scplus_obj.gene_names
cell_names = scplus_obj.cell_names
ex_matrix = scplus_obj.X_EXP

print(gene_names[0:10])

print(len(set(tf_names) & set(gene_names)))

Best,

Seppe

I see, the genes in scplus_obj are in an "_index" column within scplus_obj.metadata_genes, resulting in the following:

>>> from arboreto.utils import load_tf_names
>>> tf_names = load_tf_names('/scratch/jruiz/mouse_tf_gerg.gsc.riken.jp.txt')
>>> 
>>> print(tf_names[0:10])
['Adnp', 'Aebp1', 'Aebp2', 'Nr0b1', 'Ahr', 'Ahrr', 'Aip', 'Aire', 'Alx3', 'Alx4']
>>> 
>>> gene_names = scplus_obj.gene_names
>>> cell_names = scplus_obj.cell_names
>>> ex_matrix = scplus_obj.X_EXP
>>> 
>>> print(gene_names[0:10])
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
>>> 
>>> print(len(set(tf_names) & set(gene_names)))
0
>>> scplus_obj.gene_names
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '24589', '24590', '24591', '24592', '24593', '24594', '24595', '24596',
       '24597', '24598'],
      dtype='object', length=24599)
>>> scplus_obj.metadata_genes
               _index
0                Xkr4
1              Gm1992
2             Gm37381
3                 Rp1
4              Mrpl15
...               ...
24594  CAAA01165726.1
24595        Hashtag5
24596        Hashtag6
24597        Hashtag7
24598        Hashtag8

[24599 rows x 1 columns]

I can't edit the contents of scplus_obj.gene_names as scplus_obj.metadata_genes['_index'] directly. How would you suggest I fix this? Thanks in advance!

Just wanted to post the solution here! It seems like the error was caused by using Seurat rather than Scanpy to process the scRNA-seq data. When I created the scenicplus object with adata.raw.to_adata(), the .var slot contained rows as numbers and an "_index" column with the genes, and create_SCENICPLUS_object would take the row names as the genes in scplus_obj.gene_names. To fix this issue, I did the following:

test_data = adata.raw.to_adata()
test_data.var.columns = ['genes']
test_data.var.set_index('genes', inplace=True)
scplus_obj = create_SCENICPLUS_object(
  GEX_anndata=test_data,
  cisTopic_obj=cistopic_obj,
  menr=menr)

scplus_obj.X_EXP = np.array(scplus_obj.X_EXP.todense())

aertslab / scenicplus

run_scenicplus() error before inferring TF to gene relationships #353