Fully support sparse matrices: mismatch matrix indices in calculate_TFs_to_genes_relationships()

decarlin commented 2 years ago

Describe the bug In calculate_TFs_to_genes_relationships(), after initialization, throws this error:

Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/scenicplus/src/scenicplus/TF_to_gene.py", line 333, in calculate_TFs_to_genes_relationships ex_matrix = pd.DataFrame( File "/home/ubuntu/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/frame.py", line 737, in init mgr = ndarray_to_mgr( File "/home/ubuntu/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 351, in ndarray_to_mgr _check_values_indices_shape_match(values, index, columns) File "/home/ubuntu/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 422, in _check_values_indices_shape_match raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") ValueError: Shape of passed values is (25945, 1), indices imply (25945, 18892)

To Reproduce Here's the code, starting at a successfully created cistopic object

import itertools import anndata

with open('/data/scenic_demo/output/cistopic_obj.pkl', 'rb') as f: cistopic_obj = pickle.load(f)

import scanpy as sc

adata=sc.read_mtx('/data/CAREHF/multiome_rna_counts.mtx').T adata.obs=atac_metadata cell_data_raw = pd.read_csv('/data/CAREHF/multiome_samples.txt') adata.obs_names =cell_data_raw['x']

gene_names=pd.read_csv('/data/CAREHF/multiome_rna_features.txt') adata.var_names =gene_names['x']

adata=adata.T

import dill

menr = dill.load(open('/data/scenic_demo/carehf/motifs/menr.pkl', 'rb'))

from scenicplus.scenicplus_class import create_SCENICPLUS_object import numpy as np scplus_obj = create_SCENICPLUS_object( GEX_anndata = adata, cisTopic_obj = cistopic_obj, menr = menr, key_to_group_by = 'predicted.celltype_fromRNA', multi_ome_mode = True, bc_transform_func = lambda x: x+'___cisTopic' )

from scenicplus.preprocessing.filtering import *

filter_genes(scplus_obj, min_pct = 0.5) filter_regions(scplus_obj, min_pct = 0.5)

from scenicplus.cistromes import * merge_cistromes(scplus_obj)

from scenicplus.enhancer_to_gene import get_search_space, calculate_regions_to_genes_relationships, GBM_KWARGS from scenicplus.enhancer_to_gene import GBM_KWARGS

get_search_space(scplus_obj, biomart_host = 'http://www.ensembl.org', species = 'hsapiens', assembly = 'hg38', upstream = [1000, 150000], downstream = [1000, 150000])

calculate_regions_to_genes_relationships(scplus_obj, ray_n_cpu = 20, _temp_dir = tmp_dir, importance_scoring_method = 'GBM', importance_scoring_kwargs = GBM_KWARGS)

with open('/data/scenic_demo/carehf/scplus_obj.pkl', 'wb') as f: pickle.dump(scplus_obj, f)

from scenicplus.TF_to_gene import * tf_file = '/data/scenic_demo/allTFs_hg38.txt'

calculate_TFs_to_genes_relationships(scplus_obj, tf_file = tf_file, ray_n_cpu = 20, method = 'GBM', _temp_dir = tmp_dir, key= 'TF2G_adj')

Version (please complete the following information):

Python: 3.8.13
SCENIC+: 0.1.dev437+ga57717f

Additional context Perhaps this is related to creating the scenic object from a .mtx rather than AnnData object? However, prior to TF to genes inference, the scenic object looks fine:

scplus_obj SCENIC+ object with n_cells x n_genes = 25945 x 18892 and n_cells x n_regions = 25945 x 154930

Anyway, thanks for the work, excited to see the results...

decarlin commented 2 years ago

OK, I figured this out. scplus_obj.X_EXP was a <class 'scipy.sparse._csr.csr_matrix'> where calculate_TFs_to_genes_relationships() was expecting a dense ndarray. So I solved this with

scplus_obj.X_EXP=scplus_obj.X_EXP.toarray()

You may want to support sparse matrices for the RNAseq

SeppeDeWinter commented 2 years ago

Hi Decarlin

You are right. At this moment sparse matrices aren't fully supported yet. I'll mark this issue as an enhancement.

Best,

S

cbravo93 commented 2 years ago

Hi @decarlin !

It should work as well, we have this conversion step: https://github.com/aertslab/scenicplus/blob/main/src/scenicplus/TF_to_gene.py [288-297]. Anyways, happy you solved it :)! We will also add an issue to arboreto to directly allow it to accept sparse matrices.

Cheers!

C

aertslab / scenicplus

Fully support sparse matrices: mismatch matrix indices in calculate_TFs_to_genes_relationships() #34