Consider removing <4 gene limit

bschilder commented 1 year ago

Currently EWCE::bootstrap_enrichment_test doesn't let you run tests where the number of hit genes is <4. @NathanSkene has noted this cutoff is arbitrary and could be removed. But we should first consider the potential statistical ramifications of small gene lists within the EWCE framework.

@bschilder

what are the dangers of reducing the number of genes? from a stats standpoint

@Al-Murphy

The way I understand it, the bootstrapping works well since you are looking for the specificity averaged over a gene list. For example, consider you are just looking at the specificity of one gene. This changes the question, you are now basically asking if that gene has a higher specificity than the average specificity across all genes (due the random sampling of the background gene list). So 49% of genes tested would then be specific. I think when the number of genes you test is large the chance of seeing a FP drops. Does that make sense? It's hard to articulate I just think you shouldn't run EWCE for it as the probability of getting an enrichment in a cell type is much higher. I think this is a bit of an issue with EWCE in general since people can just reduce the size of their gene lists to get significant results. Like a form of p-value hacking. Ideally, I guess you would add some penalisation weight for smaller gene lists to avoid the issue but that would require some testing or theoretical statistical background calculations (where you keep the probability of finding enrichment equal regardless of gene list length)

We should

Test the effect of hit gene list size on EWCE p-values.
Test the effect of hit gene list size on Fisher's exact test p-values.
Compare the distributions of p-values in both cases.
Perhaps look at some of the benchmarking results that Shuhan performed, or use her framework for testing these potential biases @ss8518

Al-Murphy commented 1 year ago

I think we should be able to calculate the probability of enrichment based on gene list of length M theoretically although I would need to have a think of how. For example where there are an infinite number of bootstrap tests (N) and if M=1, it would be Prob(enrich)=rank of specificity of gene from M. For M>1, it gets a little more complex since it's the mean specificity of the gene list and bootstrap background gene list

NathanSkene commented 1 year ago

I think the probability of finding significant hits with gene lists with length of one is very low.

It’s bootstrapping, so there are not really statistical ramifications. It is measuring empirically the distribution.

Sent from Outlook for iOShttps://aka.ms/o0ukef

From: Alan Murphy @.> Sent: Monday, April 3, 2023 12:42:34 PM To: NathanSkene/EWCE @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [NathanSkene/EWCE] Consider removing <4 gene limit (Issue #79)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

I think we should be able to calculate the probability of enrichment based on gene list of length M theoretically although I would need to have a think of how. For example where there are an infinite number of bootstrap tests (N) and if M=1, it would be Prob(enrich)=rank of specificity of gene from M. For M>1, it gets a little more complex since it's the mean specificity of the gene list and bootstrap background gene list

— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/issues/79#issuecomment-1494168541, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE3LBZXGAE3XFBINZWLW7KZSVANCNFSM6AAAAAAWRFW7QA. You are receiving this because you were mentioned.Message ID: @.***>

NathanSkene / EWCE

Consider removing <4 gene limit #79