LungCellAtlas / HLCA_reproducibility

This repository contains all code used for the Human Lung Cell Atlas project.
MIT License
41 stars 12 forks source link

Should scran standardization be run separately on different data sets? #16

Closed zhongzheng1999 closed 5 months ago

zhongzheng1999 commented 6 months ago

You've done an impressive job, and this will significantly benefit everyone's future work. After examining your data processing procedure, I'm curious why you opted not to perform SCRAN normalization separately for each dataset but instead standardized them collectively. It's worth noting that many articles actually conduct SCRAN normalization for each dataset individually, taking into account batch effects (Nature Medicine PMID: 35618837). I believe the rationale behind this separate normalization might stem from the fact that SCRAN normalization involves a step where normalization factors are computed based on clustering, and batch effects can significantly influence this clustering process.

zhongzheng1999 commented 6 months ago

In fact, in scib(Nature Method PMID: 34949812), different data sets are normalized separately.

LisaSikkema commented 5 months ago

Hi @zhongzheng1999 , thank you for your kind words! I think the normalization per dataset is actually for pragmatic reasons: SCRAN is quite slow and is almost impossible to run on very large datasets. I think we needed >300Gb of RAM just to run it on the HLCA core (~0.5M cells and around 30k genes if I remember correctly). Running it on individual datasets instead significantly reduces the amount of memory needed. I think that's also why Malte and others did it that way for scIB, but maybe @LuckyMD can comment? As to the batch effect issue you raise: indeed, there is an option for SCRAN to run it on clusters separately and then to normalize between those clusters again afterwards (if I remember correctly), this is also the option we used for the HLCA. I would actually say that this rather makes SCRAN less sensitive to batch effects, as batches tend to cluster separately and so will initially be normalized separately, similar to running SCRAN on individual datasets. I have always been a bit hesitant to run SCRAN completely separately on different datasets, as there is no "harmonization" between datasets after that. Malte always argued that batch correction should be able to account and correct for that, which might be true but I was never sure. Hence we decided to run it on all datasets simultaneously for the HLCA core, and we didn't run it for the extended HLCA (for which we used good old total counts normalization). Curious to hear your thoughts!

zhongzheng1999 commented 5 months ago

Hi @LisaSikkema , Thank you for your reply! As you mentioned, my computational resources can't handle running SCRAN on the integrated dataset. So, I'm currently attempting to partition the dataset into batches and run SCRAN separately on each batch, followed by batch correction. The good old total counts normalization could also be an option.