berenslab / umi-normalization

Companion repository to Lause, Berens & Kobak (2021): "Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data", Genome Biology
https://doi.org/10.1186/s13059-021-02451-7
GNU Affero General Public License v3.0
41 stars 2 forks source link

Handling batch effects #2

Closed pavlin-policar closed 3 years ago

pavlin-policar commented 3 years ago

Hey, how would you recommend handing multiple batches using this normalization scheme? In the past, I've used scTransform, which puts the batch variable into the model itself, so I guess it kind of regresses out the batch effect. It's worked very well in the past for me. This normalization scheme here is simpler and doesn't account for batch effects in the model, so I'm wondering how you recommend dealing with them.

In your paper, in the Cao figure, I noticed you identified batch-specific genes, and just removed them from the dataset. I'm unsure of this, isn't it possible these genes might be biologically relevant? I also noticed that in your PR to scanpy comment you mention applying this normalization to each batch separately, then just concatenating the results. Wouldn't this also be problematic? For instance, if I have two batches of different cell populations and one gene is never expressed in one batch, the residuals will always be zero, since it never deviates. In the other batch, for instance the gene is always expressed. In this case the residuals will also always be zero, since the gene is always expressed, and the model mean can fit this.

I'd love to get your feedback regarding this.

jlause commented 3 years ago

Hey @pavlin-policar,

Thanks for your thoughts and questions!

How would you recommend handing multiple batches using this normalization scheme? In the past, I've used scTransform, which puts the batch variable into the model itself, so I guess it kind of regresses out the batch effect. It's worked very well in the past for me. [..]

I've quickly checked scTransform and its batch integration (which I have not used before myself). Seems like putting in the batch_var can only account for simple mean-shift batch effects. Christoph Hafemeister wrote about that e.g. here: https://github.com/ChristophH/sctransform/issues/41#issuecomment-536461604, and he only recommends it for cases where the cell type composition is roughly similar between the batches.

I also noticed that in your PR to scanpy comment you mention applying this normalization to each batch separately, then just concatenating the results. Wouldn't this also be problematic? For instance, if I have two batches of different cell populations and one gene is never expressed in one batch, the residuals will always be zero, since it never deviates. In the other batch, for instance the gene is always expressed. In this case the residuals will also always be zero, since the gene is always expressed, and the model mean can fit this.

Yes, I think what you describe is what would happen here. Yet, I assume that something similarly problematic will happen with the batch_var correction in scTransform: the model will mistake the marker gene that distinguishes the two populations for a batch effect gene, modelling both batches with a different mean. That should also result in low residuals in the end. This also fits with Christoph's recommendation to not use this batch_var method for batches that differ in celltype composition and thereby are expected to have biologically relevant info that differs between them!

To be honest, we have not thought about this much (yet), and have not implemented or benchmarked any batch-corrected version of Pearson residuals-as-data-transform. We did include a batch correction in the (upcoming) scanpy implementation of Pearson residuals-for-HVG-selection. Here, we simply followed the scanpy default batch correction, which (I think) essentially makes the same assumptions as when using scTransform's batch_var: that all batch-to-batch variation is purely technical and should be removed.

In your paper, in the Cao figure, I noticed you identified batch-specific genes, and just removed them from the dataset. I'm unsure of this, isn't it possible these genes might be biologically relevant?

I think that in this particular case this is rather unlikely: The batches were individual embryos, and each of them should have a comparable cell composition. Additionally, the vast major genes we excluded showed a strange, embryo-specific pattern (see figure below, from our Cao analysis notebook here).

image image

We also communicated with the original authors, who mentioned that the batch effect could be caused by alignment problems (where the embryo-specific part of an UMI-tag would be too similar to the sequence of an existing gene, causing spurious high counts of certain genes in single embryos, as we observed). Of course, this pattern could be superimposed on a weaker biological signal. I agree that, to recover this, more sophisticated integration could be needed.

Overall, I want to stress that we were not after a general solution for batch effects when doing this analysis of the Cao data: It was rather to demonstrate that Pearson residuals was the only method able to discover such a weird batch effect in the data, allowing us to investigate and remove it. One could have taken other, more sophisticated methods for removal instead, and I agree there will be cases where just removing the affected genes is too harsh. However, to us, the important part was to show that we were able to see this batch effect only with Pearson residuals.

Let me know what you think!

pavlin-policar commented 3 years ago

Thanks for your very thorough answer!

I've quickly checked scTransform and its batch integration (which I have not used before myself). Seems like putting in the batch_var can only account for simple mean-shift batch effects. Christoph Hafemeister wrote about that e.g. here: ChristophH/sctransform#41 (comment), and he only recommends it for cases where the cell type composition is roughly similar between the batches.

That makes sense. I suspected this, but never actually dug into it. My understanding is that they add a batch term to

image

so I suppose this term is different for every gene. Or is this term shared across all genes?

Still, isn't this approach better than simply running the normalization on each batch separately, then concatenating the results? You'd be using more data to estimate the mean, which should alleviate overfitting (but I guess overfitting isn't an issue for this model). But then again, if you added a batch term, would the model still have an analytical solution?

Regarding the Cao dataset -- I thought as much. Did you have a chance to see if scTransform is also able to show this batch effect? That's probably be extremely slow to compute, but still, it would be interesting, even on a smaller subset.

In any case, the data I'm working with has a weak batch effect, and the composition is very similar. That's why scTransform worked so well I suppose. Unfortunately, I haven't been able to find a good python scTransform implementation, and from what I can tell, this works equally well. Or at least the results are usually very similar to scTransform. I remember reading this somewhere... However, if I do nothing to the batch effect, it is noticeable, so I want to get rid of it in some way. I suppose I'll try the concatenation approach, since that seems to be the most straightforward thing to implement.

dkobak commented 3 years ago

My feeling is that using gene-specific batch offsets should be equivalent to computing Pearson residuals on each batch separately and then concatenating the results...

pavlin-policar commented 3 years ago

I'm thinking this: We could fake multiple batches. For instance, take a data set with 1000 cells and randomly assign the cells to 10 batches. Each batch has 100 cells. Wouldn't doing the estimation on the data set as a whole lead to more robust estimates and less noisy residuals than if we considered each of these "batches" with 100 cells separately?

dkobak commented 3 years ago

In our formulation, the model has only one parameter (offset) per gene. So if you have 10 batches and allow batch-specific offsets (like apparently scTransform does), then you will end up with 10 parameters per gene if analyzing the data together. That's exactly the same as analyzing 10 batches separately.

pavlin-policar commented 3 years ago

I guess that makes sense. I'll have to think about it a bit more. Thanks for the clarifications!