Significant results with randomised genes

sylvia-science commented 1 month ago

Hello,

I'm running pathfinder on a list of 10,000 genes from macrophages in scRNA data comparing two conditions. I calculate the DEGs using Seurat's FindMarkers function with standard logfc.threshold = 0.1 and min.pct = 0.01. So it's a dataframe of all genes that have an absolute logfold change of at least 0.1 and are expressed in at least 10% of the macrophages. About 1000 of these genes have adjusted p-values < 0.05.

Initially I was really happy with the results from pathfindR, but I also ran the same genes using gseKEGG and found a lot of differences where many more pathways were significant in pathfindR than in gse. I understand that the algorithms are quite different so it's normal that the results will be different, but with such little overlap I wanted to do some more testing.

As a quick sanity check, I randomised the 10,000 genes completely and ran them again. Now I see 116 pathways still significantly upregulated. With the original non-random data, I see 209 significant results. I think what's happening is that a few transcription factors that are at the center of a large network happen to be "upregulated" and then the algorithm findz pathways sharing these transcription factors. Do you think this is a reasonable amount of false positives for a random input? If I'm misunderstanding something about the input, please let me know.

Here's are the top results from using the random input.

Thank you in advance for your response!

ozanozisik commented 1 month ago

Hello, Getting enrichment results even with permuted data is a known issue in active module identification methods (PMID: 33471440 Fig. 1). As you have pointed, the methods can can connect high scoring genes on the network, which are taking part in the same biological processes. This does not directly imply that the methods find false positives. When these methods are used with the condition-specific (real) data, the enrichment results are relevant and can help understand the perturbed biological processes. If it is the case, I can suggest not to use KEGG PIN while using KEGG pathways.

egeulgen commented 1 month ago

Hi, I’d like to add a point on the large input size, which I think is contributing to the false positives in the random test. I think including a large number of genes in the analysis might be diluting the actual biological signals you might be interested in. This would capture more broad processes rather the key ones relevant to the study.

To mitigate this, I would suggest revising and incorporating a stricter definition of differentially expressed genes relevant to your study. If you believe that a large number of genes are relaxant, I would then suggest that you consider subgrouping the input genes based on correlation, i. e. run pathway enrichment analysis on subgroups of highly correlated genes (perhaps even performing WGCNA + enrichment analysis, e.g., see https://smorabit.github.io/hdWGCNA/articles/basic_tutorial.html)

sylvia-science commented 1 month ago

Hello,

Thank you both for the explanations! I will try stricter DEG criteria as I do think that the logfold change threshold can be raised to find more relevant genes. The WGCNA enrichment analysis is also a very interesting idea.

egeulgen / pathfindR

Significant results with randomised genes #212