Question about collapsePathways() logic

jasonvrogers commented 4 years ago

Hi,

I just discovered the fgsea package and I love it. Very clean and easy to use.

I tried using the collapsePathways function on my results and it threw out a few pathways that I intuitively believed shouldn't have much overlap. So I started looking into the source code and realized two things I wanted to let you know about.

First, the main problem was this parameter: nperm = 10/pval.threshold

In the following chunk of code, which uses the universe of genes WITHOUT those from the query set, the fgseaSimple() call for 200 permutations returned a p-value of ~.08. Therefore, the set was deemed redundant and discarded.

u1 <- setdiff(universe, pathways[[p]])

fgseaRes1 <- fgseaSimple(pathways = pathways[pathwaysToCheck], 
  stats = stats[u1], nperm = nperm, maxSize = length(u1) - 
    1, nproc = 1, gseaParam = gseaParam)

However, running the exact same parameters on fgseaMultilevel returned a p-value of ~10^-8, which fits my intuition that the gene sets have almost no overlap and are not redundant.

I tried increasing the number of permutations to the call above and got the following results.

As you can see, even though the "true" p-value for that set should be very low, at the default settings of pcutoff .05 and 200 permutations, it fails significance testing, and likewise fails again at pcutoff .01 and 1000 permutations.

I'm not sure whether this is due to some bizarre peculiarity with my gene sets or not (i am working with s. cerevisiae GO terms), but I would like to recommend changing the default setting to nperm = 100/pval.threshold or changing the test to fgseaMultilevel

Second, the logic of the second fgsea call makes no sense to me

At this part:

u2 <- pathways[[p]]
fgseaRes2 <- fgseaSimple(pathways = pathways[pathwaysToCheck], 
  stats = stats[u2], nperm = nperm, maxSize = length(u2) - 
    1, nproc = 1, gseaParam = gseaParam)

minPval[fgseaRes2$pathway] <- pmin(minPval[fgseaRes2$pathway], 
  fgseaRes2$pval)

parentPathways[names(which(minPval > pval.threshold))] <- p

The minimum p-value is merged into the minPval vector, and then any pathways not passing the p-value cutoff are marked redundant. This second part seems backwards... it is asking whether you still get pathway enrichment using just the genes from the query pathway "p." You would expect there to be NO enrichment for non-overlapping gene sets and a high p-value.

As an example, consider gene sets A and B. Both are highly significantly enriched in the dataset, and set A has 400 genes, and set B has the same 400 genes plus 1 more. Obviously, we would want to mark these as redundant.

In the above algorithm, the first test removes the 400 genes from set A and tests set B for enrichment. This returns a high p-value which would fail the p value cutoff.

In the second test, using only the 400 genes from set A, you would get a VERY significant enrichment in set B. Since the minimum p-value from both tests is stored in the minPval vector, gene set B will actually be marked as non-redundant.

Do I have this correct or am I missing something?

Perhaps instead we should do a second test here for anything passing the p-value cutoff, and mark those as redundant. Alternatively, I'm not sure this second test is needed at all. Just the first test, checking the pathways with the query pathway genes removed, seems sufficient.

Thanks, Jason

assaron commented 4 years ago

Hi,

Thanks for this report.

The first part looks like a bug. P-value of 1e-8 shouldn't become 0.08 that easily. Probably it happens due to skewed distribution of input stats. Can you attach or send me an e-mail with the data you are using, so I can reproduce this?

For the second part, I believe everything is correct. The important part there, that the redundancy is directional. In your case, for example, when there is a pathway A of 400 genes and pathways B of 400 + one random gene. Then pathway B will be deemed redundant, given pathway A: there will be no enrichment of this one genes on U\A background and there will be no enrichment of 400 A \intersect B genes on the background of (the same) 400 genes from A. Both P-values will be high. In other, direction, however, A probably will be also called redundant given B, because 400 in 401 genes will never get a significant p-value, but that may change, when the difference in sizes becomes higher.

jasonvrogers commented 4 years ago

Thanks for the quick reply. I'm attaching a zip containing two files and a script to reproduce the phenomenon I saw in part 1. fgsea_bug_report.zip

For part 2, the part I was missing was that the directionality is eventually checked as the loop progresses. I was only considering the case of set A vs set B... I forgot to consider when the loop eventually reaches set B vs. set A.

assaron commented 4 years ago

Thanks for the data. I can reproduce the problem, we'll try to fix it.

assaron commented 4 years ago

The bug should be fixed now (in github version).

SplitInf commented 5 months ago

Hi @assaron, I'm wondering if this patch is added to the bioconductor version (currently fgsea_1.18.0)? Looking at the changes it seems to be implemented for v1.19.2... https://github.com/ctlab/fgsea/commit/04a229522a52daf60a8c6e24b102c727e01135d9

many thanks!

assaron commented 5 months ago

@SplitInf It's been available since version 1.13.4

ctlab / fgsea

Question about collapsePathways() logic #63

At this part: