eternagame / EternaFold

Improving RNA structure prediction through multitask learning on diverse crowdsourced data.
Other
49 stars 10 forks source link

Probability of pairing greater than 1 #2

Closed domen13 closed 4 months ago

domen13 commented 1 year ago

Expected Behavior

Probability of pairing $pi$, calculated as a sum of all pairing probabilities $p{ij}$, should not exceed unity, $p_i = \sumj p{ij} \leq 1$.

Current Behavior

For longer sequences, probability of pairing for certain nucleotides exceeds 1, the difference being significantly greater than one would expect from numerical instabilities.

Reproduction Steps

Fold the following MS2 sequence (MS2_short.txt) with default parameters:

./src/contrafold predict MS2_short.txt --params parameters/EternaFoldParams.v1 --posteriors 0.00001 MS2_bps.txt

Looking at the obtained pair probability matrix (MS2_bps.txt), one observes the following two lines:

571 C 613:0.00438902 616:0.996947
572 C 612:0.00449777 615:0.996294

The total pairing probability for nucleotides 571 and 572 therefore exceeds 1: $p{571} = 1.00134$, $p{572}= 1.000792$.

Environment

No response

Other information

No response

tkaragianes commented 1 year ago

Great, thanks for this issue! We're looking into it, and we'll let you know about a fix soon. This might to be related to an issue that Contrafold had and has since fixed, so hopefully simple to resolve.

Tagging @rhiju @HWaymentSteele.

domen13 commented 5 months ago

@tkaragianes, thanks for your feedback! We are trying to use EternaFold to compare the results with another secondary structure prediction software, and we are wondering whether there has been any updates regarding this issue.

rhiju commented 5 months ago

@domen13 Thanks again for bringing this to our attention.

My lab hasn't been able to attend to this, but we can provide some quick advice.

Some time ago, @HWaymentSteele and I had noticed a similar bug in the LinearPartition implementation of Contrafold. The code's authors eventually fixed the issue by using double instead of float and exp() instead of FastExp(). This is the commit, I think: https://github.com/LinearFold/LinearPartition/commit/94bca573107237636b4cbee39560917a9b70faeb

Here's a suggestion -- have you tried to run EternaFold with LinearPartition following these instructions? If so, do you still see probability of pairing greater than 1?

If LinearPartition-E is OK, then you could potentially just use it for your tests. If you have the interest, it would be helpful then to also patch up the (non-linearized) ContraFold code that we have in the EternaFold repository -- and that could be a simple matter of updating float to double. Could you help us explore that with your test cases?

domen13 commented 4 months ago

@rhiju, thank you a lot for your response and suggestion!

I have now tested my sequence (along with a set of random sequences) on EternaFold and LinearPartition (either with ContraFold, ViennaRNA or EternaFold parameters), and it seems that your suggestion works! Neither LinearPartition (ContraFold) nor LinearPartition (EternaFold) give corrupted probabilities. You can read a more detailed report here.

Then, I changed <float> to <double> in Contrafold.cpp:121, and it looks like this solves the issue for EternaFold.

I have made a [PR]() to add USE_DOUBLE_IN_PREDICT in the configuration file.