NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

Appropiate background genes for GC & length correction #71

Closed xoelmb closed 2 years ago

xoelmb commented 2 years ago

Is your feature request related to a problem? Please describe. I have run the bootstrap enrichment test with EWCE on ~40 genesets and a scRNA-seq dataset with and without geneSizeControl (GC % & gene length control), and with automatic background genes. The number of significant results (q value < 0.05) I get when geneSizeControl is on is huge in comparison. I have been reading the code and I think that the background genes created when geneSizeControl is on include genes not present in the SCT expression dataset.

Describe the solution you'd like I think there is a simple solution to this problem: restricting the background genes used even when geneSizeControl is on to those present in the expression data. So, basically removing one check would make it. Changing this #### Restrict gene sets to only genes in the SCT dataset #### if (!geneSizeControl) { hits <- hits[hits %in% sct_genes] bg <- bg[bg %in% sct_genes] } For this #### Restrict gene sets to only genes in the SCT dataset #### hits <- hits[hits %in% sct_genes] bg <- bg[bg %in% sct_genes]

The function is check_ewce_genelist_input https://github.com/NathanSkene/EWCE/blob/9400d4cf6b260f418c89fddcc18bca681c4fbf71/R/check_ewce_genelist_input.r#L129-L133

Describe alternatives you've considered Maybe I'm missing something. Is this filtering done in some other part of the code that I have not reached? Is this expected behavior?

Additional context I think this filtering is meant to be done. According to the original EWCE article: Methods: Bootstrapping with Controls for Transcript Length and GC Content

...The deciles of gene size and GC content were calculated over the set of genes expressed in the SCT dataset (after dropping those with low expression levels as described above). The two sets of decile values were used to define a grid, and each gene assigned to a position within the grid based on it's transcript lengths and GC content...

Al-Murphy commented 2 years ago

Hey,

Thanks for bringing this to our attention. We are looking into it and will get back to you on it soon.

bschilder commented 2 years ago

Hi @xoelmb, thanks for bringing this to our attention. I just checked if this was something @Al-Murphy or I had introduced when making updates to EWCE over the last couple years, but it seems that this bit of code existed long before any of those changes were made (at least as far back as May 2018): https://github.com/NathanSkene/EWCE/blob/699ff6dead2970bf40ff5d7b73c212253930ef35/R/check.ewce.genelist.input.r#L87

I agree that this seems to contradict what the documentation says and I can't think of a reason why genes absent from the SCT/CTD wouldn't be dropped in this situation. @NathanSkene do you have any insight into why this was here originally?

NathanSkene commented 2 years ago

Can't think of any reason.... doesn't mean there wasn't one... let's drop it and see what happens?


From: Brian M. Schilder @.> Sent: 21 July 2022 22:21 To: NathanSkene/EWCE @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [NathanSkene/EWCE] Appropiate background genes for GC & length correction (Issue #71)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Hi @xoelmbhttps://github.com/xoelmb, thanks for bringing this to our attention. I just checked if this was something Alan or I had introduced when making updates to EWCE over the last couple years, but it seems that this was implemented long before any of those changes were made: https://github.com/NathanSkene/EWCE/blob/699ff6dead2970bf40ff5d7b73c212253930ef35/R/check.ewce.genelist.input.r#L87

I agree that this seems to contradict what the documentation says and I can't think of a reason why genes absent from the SCT/CTD wouldn't be dropped in this situation. @NathanSkenehttps://github.com/NathanSkene do you have any insight into why this was here originally?

— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/issues/71#issuecomment-1191947182, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE4UQ3S4J62VVPXP653VVG5O3ANCNFSM53IAS2XA. You are receiving this because you were mentioned.Message ID: @.***>

bschilder commented 2 years ago

Will do @NathanSkene

Come to think of it, I seem to recall @KittyMurphy encountered a similar effect of increased significant results when using geneSizeControl =TRUE. This likely explains why.

NathanSkene commented 2 years ago

Good point! Thanks for figuring this out


From: Brian M. Schilder @.> Sent: 21 July 2022 22:30 To: NathanSkene/EWCE @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [NathanSkene/EWCE] Appropiate background genes for GC & length correction (Issue #71)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Will do @NathanSkenehttps://github.com/NathanSkene

Come to think of it, I seem to recall @KittyMurphyhttps://github.com/KittyMurphy encountered a similar effect of increased significant results when using geneSizeControl =TRUE. This likely explains why.

— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/issues/71#issuecomment-1191953390, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE6ISHT5YHJL26ZBBFTVVG6PHANCNFSM53IAS2XA. You are receiving this because you were mentioned.Message ID: @.***>

xoelmb commented 2 years ago

Great! Thank you all for spending your time on this! I'll keep an eye on how this gets solved and in the meantime I'll be using the results without geneSizeControl. I'm happy I could make this little contribution to such a good package!

Al-Murphy commented 2 years ago

Hey! The fix has been implemented (version 1.5.4) in the Github master branch so feel free to install EWCE from there so you can use it with geneSizeControl. Otherwise it will be up on the bioconductor dev version in the next few days (usually 2-3). You can use that by installing:

BiocManager::install(version='devel')

Thanks for pointing this out!

xoelmb commented 2 years ago

Great! Thanks to you again!! ☺️