JinmiaoChenLab / Batch-effect-removal-benchmarking

A benchmark of batch-effect correction methods for single-cell RNA sequencing data
70 stars 44 forks source link

Problems about your evaluation for Scanorama on recovery of DEGs #14

Open XiHuYan opened 2 years ago

XiHuYan commented 2 years ago

In your script (https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/blob/master/Script/simulation/02_run/run_scanorama.ipynb)

you used corrected_adata.var_names = adata.var_names to update the gene names in the "corrected_adata" object which saved the integration results from Scanorama, "adata" is an object before input to Scanorama.

However, after reading the source code of Scanorama (https://github.com/brianhie/scanorama/blob/master/scanorama/scanorama.py, Line 316, function merge_datasets), I found that Scanorama will sort the gene names input to it, which means:

Given your input gene names adata.var_names=('Gene1', 'Gene2', …, 'Gene5000') and data matrix adata.X=[x1, x2, …, x5000], Scanorama will reorganize the gene names and data matrix, which are corrected_adata.var_names=('Gene1', 'Gene10', 'Gene100', …, 'Gene999') and corrected_adata.X = [x1, x10, x100,…,x999]. And the returned gene names and data matrix are in the altered order.

Thus, if running your code corrected_adata.var_names = adata.var_names, you will get:

  1. corrected_adata.var_names=('Gene1', 'Gene2', …, 'Gene5000')​
  2. corrected_adata.X = [x1, x10, x100,…,x999]

Obviously, the gene names are mismatched with the data. Then, your following evaluation for differential expressed genes will be completely wrong. After correcting this bug, I found that Scanorama achieved the state-of-the-art performance on DEGs recovery.