aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
186 stars 29 forks source link

adata.raw overwritten with normalized counts in SCENIC+ pipeline #484

Open cjiang310437 opened 1 month ago

cjiang310437 commented 1 month ago

Hello,

First of all, thank you for developing such a powerful and intuitive tool.

Describe the bug According to the documentation, the RNA count input for the SCENIC+ pipeline should consist of raw counts, which are expected to be stored in the adata.raw slot. However, after following the tutorial's steps, I observed that the adata.raw slot seems to be overwritten with normalized counts instead of retaining the raw counts.

Here are the key details:

Additionally, I tested running the pipeline using both raw and normalized RNA counts, and the results were significantly different. The results generated using normalized counts seem more promising. Could you kindly clarify which input (raw or normalized counts) is appropriate for running the SCENIC+ pipeline? It would also be helpful to understand why one input should be preferred over the other and how this impacts the pipeline results.

I appreciate your guidance and look forward to your response. Thank you again for your continued efforts in developing and maintaining this tool.

To Reproduce

adata.raw = adata
print(adata.raw.X.max())
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
print(adata.raw.X.max())

Error output 1384.0 6.714874659931793

Expected behavior The expected is that adata.raw before and after normalization should be the same.

Screenshots image

Version:

Neofita22 commented 1 month ago

First, thank you @SeppeDeWinter and Aertslab for your amazing tool!

I am wondering myself the same question. Also, I was analyzing my gene expression matrix data before and after performing Simulation Perturbation. What I have noticed is that before I run the Perturbation, the data I obtain is normalized data. I thought that for this analysis, raw data would also be included and that the function would internally normalize and/or transform it:

raw_data = adata.to_df()
scanpy_gex
gex_scenic = scplus_mdata["scRNA_counts"].to_df() 
base

The perturbed matrices take this data as a basis for their simulation analysis, and the perturbed data, I think already appears normalized, like this:

simulation_scenic = perturbation_over_iter[5]
simulation

So, I am not sure if this (starting the analysis with normalized data) is significantly altering the Perturbation analysis by already using normalized data? Or is it fine to use it this way? I dont have big experience analyzing data, especially knockouts, but any information would be very useful. Thank you.