MarioniLab / miloDE

Framework for sensitive DE testing (using neighbourhoods)
Other
57 stars 2 forks source link

DE in the presence of DA #35

Closed gdagstn closed 5 months ago

gdagstn commented 7 months ago

Hi, first of all congratulations!

I was wondering what would be a good way to tackle what is - as you say in the manuscript - a philosophical conundrum in practical terms.

Suppose we have two conditions A and B, and a population P that was identified in a joint embedding of A and B datasets. My goal is to understand the differences in both cell type composition and transcriptional regulation of B vs A; in simpler terms I just want to know "what happens and where".

Indeed, a miloR analysis confirms that there is consistent (e.g. > 50% of the P neighbourhoods with a |logFC| > 1 at spatial FDR < 0.05) DA between A and B. Now we run miloDE on these neighbourhoods and we find, as expected, several interesting DEGs in many neighbourhoods of P, plus several others in neighbourhoods of other populations Q, R, S which display no significant DA.

Even though the DE testing regime accounts for total sample size (I presume by including per-sample neighbourhood size factors in the model), I believe that the sparsity of scRNA-seq makes it difficult overcome what I would say is a "confounding effect" of DA over DE. In other words, when I look at DE in P neighbourhoods, is this (also) due to differences in cell abundance, or is it actual transcriptional regulation? This is less controversial in Q, R, S neighbourhoods because we know there is no DA, but for P neighbourhoods it seems to me that, even with size factors included in the model, it is hard to disentangle the two effects (at the end of the day, aggregating sparse counts from 10 cells vs 30 can make a big difference unless we only look at very highly expressed genes). I say all this keeping in mind that DE and DA are two faces of the same coin, since variable genes are the ones that created the embedding in the first place, so I appreciate how this is not a clear-cut issue. Also, I believe that you did look at something similar in the IPF analysis in your manuscript by binning neighbourhoods on their DA and looking at DE distributions in each bin.

I attach a screenshot of my actual data, with a gene that to me seems to be confounded:

Screenshot 2024-01-17 at 5 00 09 PM

One simple way of dealing with this, which I'm not entirely convinced of, is to crudely subtract the DA fold change from the DE fold change and call whatever is left the actual transcriptional regulation. Another simple way would be to include DA in the model - but I'm not sure it would not over-correct (also it correlates with the size factors). A third way could be to use something like variancePartition to decompose DA- and DE-associated variance, and rank genes based on that, bearing in mind that the fold changes would still be the ones from the original DE test. A fourth way is to downsample the data so that A and B have comparable numbers and run the DE test again.

Am I missing something here, or did I entirely misunderstand the approach? Does miloDE actually manage to "overcome" DA effects in its model, even when logFC in both tests are similar for a particular neighbourhood?

Sorry for the lengthy post and thanks again for developing this really cool tool!

amissarova commented 7 months ago

Hi @gdagstn , thats a great point of discussion which doesn't have a clear cut answer, but i can suggest few points that might help to interpret the results.

  1. Just to make sure we are on the same page. milo is looking for DA and comparing different neighbourhoods between each other, whereas miloDE is looking for DE within each neighbourhood separately.
  2. When DE is performed within neighbourhood, between conditions, vastly different number of cells per condition can in fact affect testing. As its done in 'original' edgeR or well any pseudo-bulk method, we use offset factors to control for that (which is total sum counts across all cells per sample/condition - which of course scales with number of cells) - so in theory, different number of cells (ie DA) shouldn't affect DE detection and specificity (since we normalise by offset factors), however, in practice, if numbers of cells differ very highly, we see occasionally some technically driven FNs (what I would personally call them) - as in if number of cells between conditions was more comparable, they wouldnt have be detected as DE, but they do in this bias setting. Therefore DE patterns you see in highly DA neighbourhoods are potentially technical artifacts, and I would highly recommend to investigate those in depth.

But, with this being said, there is a second important point inho:

  1. That I believe realistically it is very plausible (quite often), that some genes would be truly DE in those DA neighbourhoods (and not due to imbalance in cell numbers) - and that is because if you see DA neighbourhhods, it might indicate that some process/gene regulation is happening in those neighbourhoods, that force them to 'switch' the identity from those neighbourhoods to some transcriptionally neighbourhoods nearby. More likely (but not really necessarily tho) those genes would likely be markers of either immediate neighborhoods where you test them or of the neighbourhoods nearby. Therefore I would personally suggest that DE genes you detect in DA neighbourhoods might be very interesting candidates to identify what is happening in those neighbourhoods, and I wouldnt discard those DE results from DA neighbourhoods as immediately uninteresting technical artifacts.
  2. An unfortunate reality that I personally dont have a quick and systematic way to distinguish between technical FNs and real DE genes so whenever you encounter DE in DA neighbourhoods - i would recommend to investigate those in a rather supervised manner.
  3. A little of note, but I would also suggest that whenever you see DA somewhere , especially if you are dealing with continuous trajectories, focus on neighbourhoods that are 'upstream' of those DA neighbourhoods - as in what is happening in the neighbourhoods DE wise that potentially led to DA down the road (see mouse Tal1 chimera chpater in the paper, for example). That is sort of echoing point number 2, that DA sometimes is caused by DE near by.
  4. Another interesting analysis you can do, which is actually somewhat incorporated in milo manuscript/package, is to look for DE between DA neighbourhoods (i would tho subset only for healthy cells, cause otherwise what you are detecting can be in fact dominated by DE).

Since your question is a rather big topic, I dont know if I really hitting the bulls eye with my comments as in Im not sure if thats what you are asking about exactly? Pls let me know, if something is unclear, or you have more specific follow up questions. To reiterate, what you suggested re accounting for DA we do with offset/normalisation factors (total sum counts per sample/condition) - so i believe if that is you concern, we already account for this (well, as well as its accounted for in classic pseudo-bulk approaches that also have to deal with different number of cells).

P.s. I very much like your cartoon with contours - how do you do it? Which package do you use for that?

amissarova commented 7 months ago

if you want to read about this more elsewhere, it is called library size normalisation: https://support.bioconductor.org/p/79379/

amissarova commented 5 months ago

Please reopen the issue if you have additional questions