Does SoupX normalize the matrix X?

Hello @yihui and @gtca, Thanks for developing this amazing package. I'm wondering what SoupX does for the CellRanger-generated matrix files, because the sizes of SoupX outputs are different from CellRanger's. Moreover, scVelo finds that the adata X was normalized by SoupX, not raw data anymore, which cannot be used for RNA velocity analysis. Does SoupX normalize the matrix X? Thanks! Best, YJ

The sizes of SoupX outputs are much bigger than CellRanger's outputs.

The adata X was normalized by SoupX, not raw data anymore. First, let's check the CellRanger raw data.

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/filtered_feature_bc_matrix/', var_names='gene_symbols')    # load CellRanger matrix
ACTWT.write('C:/Users/Park_Lab/Documents/test.h5ad', compression='gzip')    # save as h5ad file, to eliminate the possibility that h5ad file normalizes adata X and affect the normalization by scVelo
test = sc.read('C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/test.h5ad')
ldata = scv.read(filename='C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/Apc_Cracd_WT_Tumor.loom')
test = scv.utils.merge(test, ldata)    # copy unspliced and spliced information to test file
scv.pp.filter_and_normalize(test, min_shared_counts=20, n_top_genes=2000)

Filtered out 22229 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.    # adata X is raw, scVelo can normalize X.
Extracted 2000 highly variable genes.

Then, let's check SoupX outputs.

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/filtered_feature_bc_matrix_SoupX_out/', var_names='gene_symbols')    # load SoupX output matrix
ACTWT.write('C:/Users/Park_Lab/Documents/test.h5ad', compression='gzip')
test = sc.read('C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/test.h5ad')
ldata = scv.read(filename='C:/Users/Park_Lab/Documents/Apc_Cracd_WT_Tumor/Apc_Cracd_WT_Tumor.loom')
test = scv.utils.merge(test, ldata)    # copy unspliced and spliced information to test file
scv.pp.filter_and_normalize(test, min_shared_counts=20, n_top_genes=2000)

Filtered out 22190 genes that are detected 20 counts (shared).
WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
Normalized count data: spliced, unspliced.    # X has been normalized by SoupX, scVelo cannot do normalization.
Extracted 2000 highly variable genes.

constantAmateur / SoupX

Does SoupX normalize the matrix X? #107