YeoLab / skipper

Skip the peaks and expose RNA-binding in CLIP data
Other
8 stars 3 forks source link

question about enriched terms #24

Closed yos-hida closed 9 months ago

yos-hida commented 10 months ago

I have a question about enriched gene sets. In some cases, the ranking in the PDF output (gene_sets), based on l2fc, p_adj, and p_unadjusted, does not match the ratio of 'n_genes_enriched' to 'n_genes_term'. For example, in the following case, a term with 9 enriched genes out of 20 is ranked higher than a term with 71 enriched genes out of 109. I wonder why this discrepancy occurs. When searching for more significantly enriched terms, which columns can be used for sorting the results? Thank you in advance.

term n_genes_term n_genes_enriched n_windows_enriched n_windows_total estimate statistic parameter conf.low conf.high method alternative l2fc p_unadjusted p_adj GOBP_NEGATIVE_REGULATION_OF_TRANSCRIPTION_REGU... 20 9 99 34746 0.002849 99 34746 0.002316 0.003468 Exact binomial test two.sided 1.440498 9.964187e-18 5.048853e-14 GOMF_TRANSLATION_REGULATOR_ACTIVITY_NUCLEIC_AC... 109 71 669 34746 0.019254 669 34746 0.017835 0.020754 Exact binomial test two.sided 0.487612 7.434452e-17 3.767037e-13

augustboyle commented 10 months ago

Hello,The gene set tests are based on number of windows, not number of genes so that should explain the difference.I spent some time exploring how to do gene set tests but I don’t think the current approach is truly satisfactory. I think some sort of permutation test with windows controlled for feature type, GC, and read depth would work better. The problem in evaluating approaches is that most RBPs do not selectively process a biological pathway, eg they perform splicing for all types of genes.Courtesy of my phoneOn Feb 3, 2024, at 8:28 AM, yos-hida @.***> wrote: I have a question about enriched gene sets. In some cases, the ranking in the PDF output (gene_sets), based on l2fc, p_adj, and p_unadjusted, does not match the ratio of 'n_genes_enriched' to 'n_genes_term'. For example, in the following case, a term with 9 enriched genes out of 20 is ranked higher than a term with 71 enriched genes out of 109. I wonder why this discrepancy occurs. When searching for more significantly enriched terms, which columns can be used for sorting the results? Thank you in advance. term n_genes_term n_genes_enriched n_windows_enriched n_windows_total estimate statistic parameter conf.low conf.high method alternative l2fc p_unadjusted p_adj GOBP_NEGATIVE_REGULATION_OF_TRANSCRIPTION_REGU... 20 9 99 34746 0.002849 99 34746 0.002316 0.003468 Exact binomial test two.sided 1.440498 9.964187e-18 5.048853e-14 GOMF_TRANSLATION_REGULATOR_ACTIVITY_NUCLEIC_AC... 109 71 669 34746 0.019254 669 34746 0.017835 0.020754 Exact binomial test two.sided 0.487612 7.434452e-17 3.767037e-13

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

yos-hida commented 9 months ago

Thank you very much for your answer.