constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
249 stars 34 forks source link

Does SoupX normalize the matrix X? #107

Closed hyjforesight closed 2 years ago

hyjforesight commented 2 years ago

Hello @yihui and @gtca, Thanks for developing this amazing package. I'm wondering what SoupX does for the CellRanger-generated matrix files, because the sizes of SoupX outputs are different from CellRanger's. Moreover, scVelo finds that the adata X was normalized by SoupX, not raw data anymore, which cannot be used for RNA velocity analysis. Does SoupX normalize the matrix X? Thanks! Best, YJ

The sizes of SoupX outputs are much bigger than CellRanger's outputs. image

The adata X was normalized by SoupX, not raw data anymore. First, let's check the CellRanger raw data.

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/filtered_feature_bc_matrix/', var_names='gene_symbols')    # load CellRanger matrix
ACTWT.write('C:/Users/Park_Lab/Documents/test.h5ad', compression='gzip')    # save as h5ad file, to eliminate the possibility that h5ad file normalizes adata X and affect the normalization by scVelo
test = sc.read('C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/test.h5ad')
ldata = scv.read(filename='C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/Apc_Cracd_WT_Tumor.loom')
test = scv.utils.merge(test, ldata)    # copy unspliced and spliced information to test file
scv.pp.filter_and_normalize(test, min_shared_counts=20, n_top_genes=2000)

Filtered out 22229 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.    # adata X is raw, scVelo can normalize X.
Extracted 2000 highly variable genes.

Then, let's check SoupX outputs.

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/filtered_feature_bc_matrix_SoupX_out/', var_names='gene_symbols')    # load SoupX output matrix
ACTWT.write('C:/Users/Park_Lab/Documents/test.h5ad', compression='gzip')
test = sc.read('C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/test.h5ad')
ldata = scv.read(filename='C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/Apc_Cracd_WT_Tumor.loom')
test = scv.utils.merge(test, ldata)    # copy unspliced and spliced information to test file
scv.pp.filter_and_normalize(test, min_shared_counts=20, n_top_genes=2000)

Filtered out 22190 genes that are detected 20 counts (shared).
WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
Normalized count data: spliced, unspliced.    # X has been normalized by SoupX, scVelo cannot do normalization.
Extracted 2000 highly variable genes.
constantAmateur commented 2 years ago

SoupX does not normalise the count matrix, but does result in non-integer counts which you can think of as representing probabilities that a count occurred (similar to the output from pseudo-aligners).

The file size is larger as more space is required to store a floating point number than an integer.

I suspect that scvelo guesses the data have been normalised when it sees non-integers. I would suggest setting roundToInt=TRUE when running adjustCounts which will produce an output count matrix that consists only of integers, which should not confuse downstream tools.