PathwayAndDataAnalysis / Finkle-PHYS-479

GNU Lesser General Public License v2.1
0 stars 1 forks source link

Hypergeometric test for enrichment test #2

Closed ozgunbabur closed 2 years ago

ozgunbabur commented 2 years ago

Write a script that will generate a p-value for enrichment (or for the opposite: deficiency), for a given set of values in a 2x2 table, using the hypergeometric test.

Step 1: Find a library that will run the hypergeometric test on a given 2x2 table to generate a p-value.

Step 2: Implement a script that will use that library to calculate a p-value for the enrichment in a given 2x2 table.

Here is an example. Let's say we are given the below 2x2 table

- +
- 15 10
+ 3 5

And we want to check if the number 5 indicates an enrichment (i.e. 5 is too high assuming an independent distribution of these two features). Then the enrichment test will apply the hypergeometric test on this distribution and more imbalanced distributions. These are

- +
- 15 10
+ 3 5
- +
- 16 9
+ 2 6
- +
- 17 8
+ 1 7
- +
- 18 7
+ 0 8

The sum of p-values from hypergeometric tests of these distributions will give us the one-tailed p-value for enrichment. Here we found the probability of the +/+ case being 5 or more by random.

To test for deficiency instead (to see if 5 is a significantly low number), we need to find the probability of the +/+ case being 5 or fewer by random. In that case, the distributions we need to include are

- +
- 15 10
+ 3 5
- +
- 14 11
+ 4 4
- +
- 13 12
+ 5 3
- +
- 12 13
+ 6 2
- +
- 11 14
+ 7 1
- +
- 10 15
+ 8 0
nurith commented 2 years ago

Ozgun,

I may have missed the original emails, can you please send a link to some info about enrichment tables? I'm not sure how to read them.

Thanks!

On 2022-02-25 11:27, Özgün Babur wrote:

EXTERNAL SENDER

Write a script that will generate a p-value for enrichment (or for the opposite: deficiency), for a given set of values in a 2x2 table, using the hypergeometric test.

Step 1: Find a library that will run the hypergeometric test on a given 2x2 table to generate a p-value.

Step 2: Implement a script that will use that library to calculate a p-value for the enrichment in a given 2x2 table.

Here is an example. Let's say we are given the below 2x2 table

  • +
    • 15 10
    • 3 5

And we want to check if the number 5 indicates an enrichment (i.e. 5 is too high assuming an independent distribution of these two features). Then the enrichment test will apply the hypergeometric test on this distribution and more imbalanced distributions. These are

  • +

    • 15 10
    • 3 5
  • +

    • 16 9
    • 2 6
  • +

    • 17 8
    • 1 7
  • +

    • 18 7
    • 0 8

The sum of p-values from hypergeometric tests of these distributions will give us the one-tailed p-value for enrichment. Here we found the probability of the +/+ case being 5 or more by random.

To test for deficiency instead (to see if 5 is a significantly low number), we need to find the probability of the +/+ case being 5 or fewer by random. In that case, the distributions we need to include are

  • +

    • 15 10
    • 3 5
  • +

    • 14 11
    • 4 4
  • +

    • 13 12
    • 5 3
  • +

    • 12 13
    • 6 2
  • +

    • 11 14
    • 7 1
  • +

    • 10 15
    • 8 0

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPathwayAndDataAnalysis%2FFinkle-PHYS-479%2Fissues%2F2&data=04%7C01%7Cnurit.haspel%40umb.edu%7C72e5185ebbab438aefdf08d9f87bb9f2%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637814032564263181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bCJWsYFUUNZ0XXe6GYtjdZx79cSRxFYbzWp2VoU9%2BxY%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA2A7T4PG62LYHMAUCCZGVDU46UXJANCNFSM5PKWUE5A&data=04%7C01%7Cnurit.haspel%40umb.edu%7C72e5185ebbab438aefdf08d9f87bb9f2%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637814032564263181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zhPOW3uRlXYlliHqdvjroiz4UIY3Rc5Mn6SyGwXgdp8%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOS https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cnurit.haspel%40umb.edu%7C72e5185ebbab438aefdf08d9f87bb9f2%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637814032564263181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=geTZeXijRM2hvXoFOJh4O42R8y%2FFARXAfhexdS5PDrY%3D&reserved=0 or Android https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cnurit.haspel%40umb.edu%7C72e5185ebbab438aefdf08d9f87bb9f2%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637814032564263181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=1reaE%2BHkZ%2Fcie15MD2GwD4%2FuzmKaS0ABK4WaeDtXMmI%3D&reserved=0.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

ozgunbabur commented 2 years ago

Here is the link to the google docs describing those tables: https://docs.google.com/presentation/d/1ycjs3BdIcXiQz7uCx9KUWYRa9cgc0Ur-CP0R94qRy30/edit?usp=sharing

ozgunbabur commented 2 years ago

And please write tests to make sure your code works as expected. Here are some tables for testing.

Assume the letters a, b, c, and d represent the counts as displayed below.

- +
- a b
+ c d

Then if a = 4, b = 6, c = 6, d = 4, The enrichment p-value should be 0.9105522960012121 The deficiency p-value should be 0.3281408993483296

If a = 20, b = 30, c = 25, d = 30, The enrichment p-value should be 0.7766662300662146 The deficiency p-value should be 0.357198690677301

If a = 15, b = 8, c = 20, d = 42 The enrichment p-value should be 0.006445865568610187 The deficiency p-value should be 0.9985899806396821

If a = 21, b = 20, c = 34, d = 13 The enrichment p-value should be 0.9883420938210076 The deficiency p-value should be 0.0341612031176084

AdamFinkleUMB commented 2 years ago

I've implemented the test but have not yet discovered how to fix the p_value calculation method...

AdamFinkleUMB commented 2 years ago

Fixed it! It was a small error in the code. Now, it passes 3 of 4 tests!

ozgunbabur commented 2 years ago

Almost there! There are a few problems.

  1. Please have 2 different methods: one for enrichment and one for deficiency. Instead of a "p_values" method, you can have an "enrichment_pval" and a "deficiency_pval" methods. We don't want these two p-values to be calculated together every time.
  2. Your call of hypergeom.pmf for calculating the deficiency p-value puzzled me. This method takes these 4 parameters: (selected_favorable, trials, favorable, selected). And you use exactly this way for the enrichment p-values. But for deficiency p-value, you send "unfavorable" instead of favorable. Why is that?
  3. The calculation of "more_favorable" and "less_favorable" arrays does not seem correct. The first array starts from selected_favorable and goes up to the max possible value. You assumed that max possible value is "favorable" but this is only sometimes true. If "selected" is smaller than "favorable", then it can only go up to "selected". In other words, the max possible value is min(selected, favorable). The calculation of "less_favorable" array is more problematic. This array should again start from "selected_favorable" and go down to the minimum possible value. That minimum possible value is max(0, selected + favorable - trials).
AdamFinkleUMB commented 2 years ago

I've applied your changes, and now the test passed only 3/4 of the enrichment tests and none of the deficiency ones, 3/4 of which it passed before.

ozgunbabur commented 2 years ago

There is a bug in the deficiency pval implementation. Please look at line 19.

AdamFinkleUMB commented 2 years ago

I feel silly. The code seemed fine to me, so I sought errors in what I was trying to do, but then just ran the code and saw the return statement lacked a closing parenthesis.


From: Özgün Babur @.> Sent: Tuesday, March 29, 2022 7:55 PM To: PathwayAndDataAnalysis/Finkle-PHYS-479 @.> Cc: Adam E Finkle @.>; Assign @.> Subject: Re: [PathwayAndDataAnalysis/Finkle-PHYS-479] Hypergeometric test for enrichment test (Issue #2)

EXTERNAL SENDER

There is a bug in the deficiency pval implementation. Please look at line 19.

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPathwayAndDataAnalysis%2FFinkle-PHYS-479%2Fissues%2F2%23issuecomment-1082478064&data=04%7C01%7CAdam.Finkle001%40umb.edu%7C24a38e2b2c6c44d6ded908da11dfacf1%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637841949628659252%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vaXyjQ3Qz%2B3oN19BvANuDA8IHArhWkH3YkP8R3uvies%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAX2HYGDWN3WGBYCS6UN2PZLVCOKA3ANCNFSM5PKWUE5A&data=04%7C01%7CAdam.Finkle001%40umb.edu%7C24a38e2b2c6c44d6ded908da11dfacf1%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637841949628659252%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L9MHRQ3rlL4Xb0EmoEQcKuQddR0hVxb18LThlQhdsjY%3D&reserved=0. You are receiving this because you were assigned.Message ID: @.***>