Simon-Coetzee / motifBreakR

A Package For Predicting The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
28 stars 12 forks source link

parallel execution of calculatePvalue? #35

Closed paul-shannon closed 2 years ago

paul-shannon commented 3 years ago

@scoetzee,

Now that the parallel execution of MotifbreakR works so well, I wonder if this could be made available for calcaultePvalues also?

Any suggestions?

paul-shannon commented 2 years ago

@scoetzee Many months later than predicted, I am now using motifbreakR at scale, preparing a talk for this Friday: Exploring for tissue-specific effects of non-coding variants at cryptic AD GWAS loci

It would be really great if calculatePvalue ran in parallel. Is there any chance this could happen soon? I'd be most grateful.

Simon-Coetzee commented 2 years ago

So I ran some tests, and it should be possible. One caveat is that some p-value calculations can take very very very long, and take a huge amount of memory - due to the dynamic programming method that's used to calculate them. This is why I have been hesitant to implement it - most of the time and memory is spent on one or two snps. However, I believe that it could be made more deterministic if I use round matrix with a fixed granularity that could be set by the user. Would that be useful for your purposes?

paul-shannon commented 2 years ago

Yes, Simon, that would be very useful. All my calculations come with caveats and probabilities, as I assemble lots of sometimes reinforcing tentative evidence.

So adding even imperfect pvalues is a boon.

On Mar 28, 2022, at 3:43 PM, Simon Coetzee @.***> wrote:

So I ran some tests, and it should be possible. One caveat is that some p-value calculations can take very very very long, and take a huge amount of memory - due to the dynamic programming method that's used to calculate them. This is why I have been hesitant to implement it - most of the time and memory is spent on one or two snps. However, I believe that it could be made more deterministic if I use round matrix with a fixed granularity that could be set by the user. Would that be useful for your purposes?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

Simon-Coetzee commented 2 years ago

It looks like the pvalue stabilizes somewhere around a granularity of 1e-4 for this particular snp for the jaspar database.

To be clear the p-value represents the the p-value for the reference or alternate allele binding. I have added an alleleEffectSize that is something like the proportion of the change caused by ref vs alt over the total possible pwm score.

The current version on here 2.8.99 has these features

p-value vs. rank for a particular SNP against all motifs at many granularities zoomed to the first 50 motifs

paul-shannon commented 2 years ago

Thanks, Simon. In my so far limited first use, calculatePvalue runs fast, and provides useful information.

I'm grateful.