adata.raw overwritten with normalized counts in SCENIC+ pipeline

Hello,

First of all, thank you for developing such a powerful and intuitive tool.

Describe the bug According to the documentation, the RNA count input for the SCENIC+ pipeline should consist of raw counts, which are expected to be stored in the adata.raw slot. However, after following the tutorial's steps, I observed that the adata.raw slot seems to be overwritten with normalized counts instead of retaining the raw counts.

Here are the key details:

I confirmed that my raw count matrix was correctly loaded into the AnnData object initially.
After running the normalization steps as described in the tutorial, I noticed that adata.raw now contains the normalized data, not the raw counts.
This appears to contradict the documentation, which specifies that the adata.raw slot should contain raw counts and that these should be used as input for the SCENIC+ pipeline.

Additionally, I tested running the pipeline using both raw and normalized RNA counts, and the results were significantly different. The results generated using normalized counts seem more promising. Could you kindly clarify which input (raw or normalized counts) is appropriate for running the SCENIC+ pipeline? It would also be helpful to understand why one input should be preferred over the other and how this impacts the pipeline results.

I appreciate your guidance and look forward to your response. Thank you again for your continued efforts in developing and maintaining this tool.

To Reproduce

adata.raw = adata
print(adata.raw.X.max())
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
print(adata.raw.X.max())

Error output 1384.0 6.714874659931793

Expected behavior The expected is that adata.raw before and after normalization should be the same.

Screenshots

Version:

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
Scanpy version: 1.8.2
SCENIC+ version: 1.0a1

First, thank you @SeppeDeWinter and Aertslab for your amazing tool!

I am wondering myself the same question. Also, I was analyzing my gene expression matrix data before and after performing Simulation Perturbation. What I have noticed is that before I run the Perturbation, the data I obtain is normalized data. I thought that for this analysis, raw data would also be included and that the function would internally normalize and/or transform it:

raw_data = adata.to_df()

gex_scenic = scplus_mdata["scRNA_counts"].to_df()

The perturbed matrices take this data as a basis for their simulation analysis, and the perturbed data, I think already appears normalized, like this:

simulation_scenic = perturbation_over_iter[5]

So, I am not sure if this (starting the analysis with normalized data) is significantly altering the Perturbation analysis by already using normalized data? Or is it fine to use it this way? I dont have big experience analyzing data, especially knockouts, but any information would be very useful. Thank you.

aertslab / scenicplus

adata.raw overwritten with normalized counts in SCENIC+ pipeline #484