findMarkers produces unexpectedly different results with passing different values to direction

MarioniLab / scran

Clone of the Bioconductor repository for the scran package.

https://bioconductor.org/packages/devel/bioc/html/scran.html

39 stars 23 forks source link

findMarkers produces unexpectedly different results with passing different values to direction #86

Closed wir963 closed 3 years ago

wir963 commented 3 years ago

Hey,

I am using scran v1.16.0 (installed from the conda-forge/bioconda channel) and noticed inconsistent behavior for the function findMarkers with block=plate and lfc=0.5 when using direction="up" compared to direction="any". For this example, I will assume test="wilcox" but note that I find similar results when using test="t". For genes that are up-regulated, the reported p-values are lower (more significant) and the AUCs are higher when using direction="any" compared to using direction="up".

To compare, I repeated this analysis using lfc=0 and did not find any unexpected differences between the results using direction="any" and direction="up" (the AUCs were identical and the p-values for up-regulated genes when using direction="up" are half of the p-values when using direction="any", which is the expected when comparing the results from a one-sided test to a two-sided test).

I further experimented and removed block=plate and still get unexpectedly different AUCs and p-values.

Can you help me understand what's going on here? Am I missing something important? I know that scran v1.16.0 is not the latest version but I'm in the end stages of a research project so I'd rather not have to recompute all of my results.

Best, Welles

wir963 commented 3 years ago

Hey,

Here's a reproducible example. I am using scran 1.18.0 and scuttle 1.0.0 (freshly installed via the conda-forge/bioconda channel) to generate the below

library(scuttle)
library(scran)
set.seed(1000)

sce <- mockSCE()
sce <- logNormCounts(sce)
kout <- kmeans(t(logcounts(sce)), centers=4)
out.up <- findMarkers(sce, groups=kout$cluster, lfc=1, test.type="wilcox", direction="up")
out.any <- findMarkers(sce, groups=kout$cluster, lfc=1, test.type="wilcox", direction="any")

out.any[["1"]]["Gene_1726",]

returns

DataFrame with 1 row and 7 columns
                Top   p.value       FDR summary.AUC     AUC.2     AUC.3
          <integer> <numeric> <numeric>   <numeric> <numeric> <numeric>
Gene_1726         2 0.0565893         1    0.713273  0.689478  0.713273
              AUC.4
          <numeric>
Gene_1726   0.62537

while

out.up[["1"]]["Gene_1726",]

returns

DataFrame with 1 row and 7 columns
                Top   p.value       FDR summary.AUC     AUC.2     AUC.3
          <integer> <numeric> <numeric>   <numeric> <numeric> <numeric>
Gene_1726         1 0.0565843         1    0.631877  0.588639  0.631877
              AUC.4
          <numeric>
Gene_1726  0.519662

As you can see (and hopefully reproduce), the summary.AUC is dramatically different between these two results. If you look carefully, you'll also notice a slight change in the p.value (which is actually less than you should see comparing the one-sided test to the two-sided test).

Unless I'm missing something (which is totally possible), there is something wrong with one or both of these functions, which is having a serious effect on the reported results.

Best, Welles

LTLA commented 3 years ago

This behavior is probably intended. The interaction of lfc= with direction= is quite non-intuitive for the Wilcoxon test.

This is mostly described in ?pairwiseWilcox, but I'll recap. Let's say we have lfc=1 and direction="up" for cluster A vs B (so, up in A over B). To form the null for Wilcoxon, we literally shift A's expression values to the left by 1, and then we apply the usual Wilcoxon calculations - this means that a gene can only be significant if there is still some difference after applying the shift. For the curious, this implementation is based on that in the base wilcox.test function, so I didn't just make it up. Anyway, the reported AUC is derived from the Wilcoxon calculations after the shift. This is largely for convenience, because that's what I get from the algorithm, but it's also helpful as the reported AUC incorporates the chosen lfc. So a user knows that if the AUC is high, the separation between distributions is greater than the specified lfc.

Now, the above description relates to direction="up". For direction="any" with lfc > 0, things get weird because the null hypothesis is defined by a shift interval of [-lfc, lfc] rather than a single shift value. Technically speaking, we would consider an expression value from A and B to be tied if their absolute difference was less than or equal to lfc, and then compute the Wilcoxon test statistic (and thus the AUC) from that. Now, I couldn't be bothered to do that, and I suspect we'd be wandering outside the relevant theory anyway, so I did the next best thing and averaged the AUCs after shifting in both directions, which is close enough. Regardless of how I choose to implement it, though, the consequence is that the AUC will differ with direction= when lfc > 0.

LTLA commented 3 years ago

Further jiggling of the Pensieve causes some more memories to bubble up to the surface.

As I mentioned above, with direction="up" and lfc > 0, we compute the AUC relative to the lfc. So, if your log-fold change was exactly lfc and the distributions were otherwise identical, the test would report an AUC of 0.5. Conversely, if you had a large AUC, you would know that you had your distributions were strongly separated by at least lfc. This conclusion would not be possible if we reported the AUC at lfc=0, which could be large even when the distributions are separated by less than lfc. The same logic applies for direction="down", except that you'd be looking for AUCs below 0.5 for separation in the desired direction.

Things get interesting when we set direction="any" with lfc > 0. One could say, "oh, just report the AUC from the one-sided test in the direction that the change was significant". This would ensure that the AUCs here are the same as those from direction="up" for up-regulated genes and the same as those from direction="down" for down-regulated genes. Unfortunately, this would make the AUCs quite difficult to interpret. If I got an AUC above 0.5, does this mean that X is significantly greater than Y by lfc? Or does it mean that X is not significantly less than Y by lfc? Making the distinction requires us to report a separate field for each pairwise comparison to specify the direction that we're talking about, and would be even more of a pain.

Hence the current situation. If you want AUCs at lfc=0, you can just compute them with another run of findMarkers and then swap out the p-values with the ones that you get from lfc > 0. This is best done by turning off the sorting so you can just do a regular replacement of the values rather than having to match up gene names across runs.

wir963 commented 3 years ago

Hey Aaron,

Thank you very much for the explanation. I now understand why the AUCs for direction="any" are different from the AUCs for direction="up" and the right way to report these values in my paper. I also understand why the calculation is much harder to code up for direction="any". Thanks as well for the pointer to ?pairwiseWilcox.

I am still confused about how direction="any" works with lfc>0 when using block="plate". My original issue is that I get very different p-values for using lfc>0 with direction="any" vs. direction="up" (I mis-directed you when I tried to come up with a reproducible example). More specifically, I get significantly lower p-value using lfc>0 with direction="any" compared to direction="up".

?pairwiseWilcox has a section on Blocking on Uninteresting Factors where they discuss this

When combining across batches, one-sided p-values in the same direction are combined first. Then, if direction="any", the two combined p-values from both directions are combined. This ensures that a gene only receives a low overall p-value if it changes in the same direction across batches.

I don't completely understand this section. The last sentence is the expected behavior but this seems to contradict the output that I am seeing. Is there anything that I'm missing? I'm trying to wrap my head around how this works with lfc>0.

Best, Welles

LTLA commented 3 years ago

Remember that the p-values from findMarkers() is itself a combination of statistics across all of pairwise comparisons between one cluster and each of the other clusters. As such, I'm willing to bet that the affected gene is:

strongly downregulated in one of the pairwise comparisons, and
this downregulation is consistent across batches.

It may still be upregulated compared to most of the other clusters, but it just has to be (very) strongly downregulated in one of the pairwise comparisons, and that drives a lower reported p-value in findMarkers(). When you switch to direction="up", the function doesn't consider downregulation to be significant anymore, and so that low p-value gets replaced with a large p-value.

This sentence:

This ensures that a gene only receives a low overall p-value if it changes in the same direction across batches.

only applies to the combination of p-values across batches within a single pairwise comparison, not the combination across multiple comparisons. By default, we don't require DE in the same direction across all comparisons, because that's too stringent. (Though you can do so with pval.type="all" combined with one or the other setting of direction=.)

wir963 commented 3 years ago

Hey Aaron,

I'm actually just doing a comparison between two groups of cells (I remove the rest in a pre-processing step) and I get more significant p-values when using direction="any" than direction="up" or direction="down" so I don't think either of your explanations explains the behavior that I am seeing.

I tried to generate a simple example that reproduces my issue but so far these examples work like they should and don't display the weird behavior that I've noticed. If you would be willing to do so, could you try to run the following code using the SCE in the attached zip file and see if you get the same strange results that I report below? TH179_example.rds.zip

library(scran)
library(scater)
sce <- readRDS("TH179_example.rds")
wilcox.markers <- findMarkers(sce, test="wilcox", groups=sce$celltype1, lfc=0.5, pval.type="all", block=sce$plate, direction="any")
wilcox.markers$Tumor

returns

DataFrame with 7 rows and 4 columns
          p.value       FDR summary.AUC AUC.immune
        <numeric> <numeric>   <numeric>  <numeric>
1760   0.00217518 0.0152263    0.641439   0.641439
91061  0.02616132 0.0811387    0.594849   0.594849
1236   0.03477372 0.0811387    0.581584   0.581584
28211  0.32485513 0.5264306    0.547596   0.547596
188787 0.37602185 0.5264306    0.545972   0.545972
28216  0.85349862 0.9600197    0.499520   0.499520
186801 0.96001973 0.9600197    0.498719   0.498719

while

wilcox.markers <- findMarkers(sce, test="wilcox", groups=sce$celltype1, lfc=0.5, pval.type="all", block=sce$plate, direction="up")
wilcox.markers$Tumor

returns

DataFrame with 7 rows and 4 columns
         p.value       FDR summary.AUC AUC.immune
       <numeric> <numeric>   <numeric>  <numeric>
1760   0.0742456  0.519719    0.541650   0.541650
1236   0.2957194  0.752531    0.515164   0.515164
91061  0.3225134  0.752531    0.513700   0.513700
28211  0.7499655  0.999105    0.480399   0.480399
186801 0.9933925  0.999105    0.428434   0.428434
188787 0.9988267  0.999105    0.417044   0.417044
28216  0.9991053  0.999105    0.410823   0.410823

and

wilcox.markers <- findMarkers(sce, test="wilcox", groups=sce$celltype1, lfc=0.5, pval.type="all", block=sce$plate, direction="down")
wilcox.markers$Tumor

returns

DataFrame with 7 rows and 4 columns
         p.value       FDR summary.AUC AUC.immune
       <numeric> <numeric>   <numeric>  <numeric>
186801  0.993396         1    0.569004   0.569004
28216   0.999051         1    0.588216   0.588216
28211   0.999980         1    0.614793   0.614793
1236    1.000000         1    0.648003   0.648003
91061   1.000000         1    0.675998   0.675998
188787  1.000000         1    0.674901   0.674901
1760    1.000000         1    0.741229   0.741229

LTLA commented 3 years ago

After staring at it for a while, I don't think there's any major problem here. The underlying cause is that Stouffer's method (used when block!=NULL) has some interesting interactions with the one-sided p-values computed when direction="any" and lfc > 0. It's not entirely clear that these interactions are generally correct (or incorrect) - this would require some thought.

As it is currently implemented, direction="any" assumes that the probability mass under the null is distributed with 50% at -lfc and 50% at lfc; ironically, this makes it less conservative for detecting upregulated genes than direction="up", which assumes that the probability mass is fully concentrated at lfc under the null. This explains the lower p-values with direction="any".

That distributional assumption is rather limiting and does cause some oddities as seen here, but I don't think it causes major damage to downstream inferences. (Of course, one should not rely on the magnitude of these p-values for anything other than ranking genes within the same dataset.) Nonetheless, I'll try to think of a better way to do it.

wir963 commented 3 years ago

Hey Aaron,

Thanks for your thoughtful discussion on this. From a user's perspective, I think the intuitive result for direction="all" with lfc>0.5 is to calculate the one-sided p-value for lfc>0.5 and calculate the one-sided p-value for lfc<-0.5 and then pick the more significant p-value and then convert to a two-sided p-value by doubling it. This is the approach that we're planning to take by combining the outputs of different runs of pairwiseWilcox. Would you mind if we cite this discussion as a personal correspondence with Aaron Lun to justify why we are doing the above instead of simply running direction="any" with lfc > 0?

Best, Welles

LTLA commented 3 years ago

This is the approach that we're planning to take by combining the outputs of different runs of pairwiseWilcox.

Yes, that would be the simple approach. Unfortunately, it is more conservative than it needs to be in the special case of lfc > 0, direction="any" and block=NULL. The current approach is based on how limma's treat() method is implemented and is more powerful when lfc gets large but doesn't generalize naturally to block!=NULL.

Would you mind if we cite this discussion as a personal correspondence with Aaron Lun to justify why we are doing the above instead of simply running direction="any" with lfc > 0?

Or you could just link to this issue.

LTLA commented 3 years ago

Well, I'm not too happy about it, but I ended up switching direction="any" to the simple 2 * min(up, down, 0.5) approach. It yields p-values that are twice as large as they need to be in the case of a large lfc, but I guess it doesn't really matter, given that the p-values computed on single-cell data are not to be taken at face value anyway.

This should be available on BioC-devel in a day or so.

wir963 commented 3 years ago

Hey Aaron,

Thanks for letting me know! I'll go ahead and close out this issue.

Best, Welles