you used corrected_adata.var_names = adata.var_names to update the gene names in the "corrected_adata" object which saved the integration results from Scanorama, "adata" is an object before input to Scanorama.
Given your input gene names adata.var_names=('Gene1', 'Gene2', …, 'Gene5000') and data matrix adata.X=[x1, x2, …, x5000], Scanorama will reorganize the gene names and data matrix, which are corrected_adata.var_names=('Gene1', 'Gene10', 'Gene100', …, 'Gene999') and corrected_adata.X = [x1, x10, x100,…,x999]. And the returned gene names and data matrix are in the altered order.
Thus, if running your code corrected_adata.var_names = adata.var_names, you will get:
Obviously, the gene names are mismatched with the data. Then, your following evaluation for differential expressed genes will be completely wrong. After correcting this bug, I found that Scanorama achieved the state-of-the-art performance on DEGs recovery.
In your script (https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/blob/master/Script/simulation/02_run/run_scanorama.ipynb)
you used
corrected_adata.var_names = adata.var_names
to update the gene names in the "corrected_adata" object which saved the integration results from Scanorama, "adata" is an object before input to Scanorama.However, after reading the source code of Scanorama (https://github.com/brianhie/scanorama/blob/master/scanorama/scanorama.py, Line 316, function
merge_datasets
), I found that Scanorama will sort the gene names input to it, which means:Given your input gene names
adata.var_names=('Gene1', 'Gene2', …, 'Gene5000')
and data matrixadata.X=[x1, x2, …, x5000]
, Scanorama will reorganize the gene names and data matrix, which arecorrected_adata.var_names=('Gene1', 'Gene10', 'Gene100', …, 'Gene999')
andcorrected_adata.X = [x1, x10, x100,…,x999]
. And the returned gene names and data matrix are in the altered order.Thus, if running your code
corrected_adata.var_names = adata.var_names
, you will get:corrected_adata.var_names=('Gene1', 'Gene2', …, 'Gene5000')
corrected_adata.X = [x1, x10, x100,…,x999]
Obviously, the gene names are mismatched with the data. Then, your following evaluation for differential expressed genes will be completely wrong. After correcting this bug, I found that Scanorama achieved the state-of-the-art performance on DEGs recovery.