aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
177 stars 28 forks source link

Incorrect preparation of non-multiome data causes error in the step 'Calculating region to gene importance, using GMB method' #376

Closed AthanasiaSt closed 5 months ago

AthanasiaSt commented 5 months ago

Discussed in https://github.com/aertslab/scenicplus/discussions/371

Originally posted by **AthanasiaSt** April 25, 2024 Hello, Thank you for the development of scenicplus and it's helpful documentation. I am trying to use scenicplus on non-multiome data that I have previously integrated by utilizing ArchR. By following the corresponding tutorials of scRNA and scATAC preprocessing I have reached the point of running scenicplus through the snakefile by altering accordingly the config.yaml file. More importantly, I made sure that the anndata and cistopic objects both contained a variable under the name 'ACC:RNA_barcodes' with the same cell_names based on the integration of the two modalities (in total 5956 cells). The pipeline is progressing smoothly until it reaches the point of calculating the region to gene importance, when it gives out the following error: ``` 2024-04-25 16:01:14,238 SCENIC+ INFO Reading search space 2024-04-25 16:01:14,741 R2G INFO Calculating region to gene importances, using GBM method Running using 20 cores: 1%|▌ | 180/12438 [00:06<04:14, 48.10it/s]Traceback (most recent call last): File "/home/astavropoulou/anaconda3/envs/scenicplus/bin/scenicplus", line 8, in sys.exit(main()) ^^^^^^ File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main args.func(args) File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 328, in TF_to_gene infer_region_to_gene( File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 501, in infer_region_to_gene adj = calculate_regions_to_genes_relationships( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/enhancer_to_gene.py", line 261, in calculate_regions_to_genes_relationships region_to_gene_importances = _score_regions_to_genes( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/enhancer_to_gene.py", line 219, in _score_regions_to_genes joblib.Parallel( File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/joblib/parallel.py", line 1098, in __call__ self.retrieve() File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/joblib/parallel.py", line 975, in retrieve self._output.extend(job.get(timeout=self.timeout)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result return future.result(timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception ValueError: Found array with 0 sample(s) (shape=(0, 121)) while a minimum of 1 is required by GradientBoostingRegressor. ``` By trying to figure out what went wrong, I realized that the resulting ACC_GEX.h5mu file that should contain the two modalities is not prepared correctly as it seems to lack both the cell names and the expression/fragment matrices, as show here: ``` MuData object with n_obs × n_vars = 0 × 316886 backed at '/home/astavropoulou/scenicplus_final/scplus_pipeline/Snakemake/ACC_GEX.h5mu' 2 modalities scRNA: 0 x 16049 obs: 'ACC:RNA_barcodes' var: 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts' scATAC: 0 x 300837 obs: 'ACC:RNA_barcodes' var: 'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc' ``` No error occurred during the step of preparing the non-multiome data, as all cells were found in both modalities, but there is a suspicious output during the procedure of ingestion as shown here: ``` [Thu Apr 25 15:56:12 2024] Finished job 8. 2 of 14 steps (14%) done Select jobs to execute... 2024-04-25 15:56:17,072 Ingesting non-multiome data INFO Automatically set `nr_metacells` to: AKP_APC_AOMDSS_cnt_aom_AAACCCAGTCCACGCA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACCCATCAAGCTTG-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAACACACCAGC-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAACATGACTGT-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAAGTCCCACGA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAAGTGACGCCT-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAATCGTACCTC-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGCTTCAAGTGGG-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGATAGCCTGCCA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGATAGTACTCGT-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGATAGTTAACGA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGGCAGCTGGCTC-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGGCAGGGAGAAT-1: 0.... ``` Do the zeros next to the cell names mean that no metacells and no pseudo multi-ome data are created? I tried to run the pipeline with slight modifications in the anndata and cistopic objects but could not figure out the problem. Do you maybe have any idea on why this problem comes up? Also could you specify in greater detail the procedure of preparing non-multiome data for scenicplus? Should the anndata and the cistopic objects have the exact same cell_names in the corresponding matrices or a single variable/column that is common between the two modalities is enough for the two to be finally combined? Thank you! I am using: Python version: Python 3.11.8 scenicplus version: 1.0a1
SeppeDeWinter commented 5 months ago

Hi @AthanasiaSt

Please don't open duplicate issues :).

See my answer on: https://github.com/aertslab/scenicplus/discussions/371

Best,

Seppe