PoonLab / OpenRDP

An open-source re-implementation of the RDP4 recombination detection program
GNU General Public License v3.0
45 stars 9 forks source link

Discordance between OpenRDP and RDP v5.5 #53

Closed ArtPoon closed 1 year ago

ArtPoon commented 1 year ago
ArtPoon commented 1 year ago

I think a good starting point in analyzing this discordance is to call the binaries directly on data inputs, and compare those outputs to what we are reporting

ArtPoon commented 1 year ago

@GopiGugan can you please check if this affects master branch as well as dev?

darrenpmartin commented 1 year ago

This is maybe expected since, as far as i am aware, there are three potentially important differences between OpenRDP and RDP:

(1) OpenRDP does not seem to make an attempt to reconcile recombination events that yielded the detected recombinant sequences with the recombinant signals it detects. From what I understand, with OpenRDP if two sequences in a datasets are both descended from an ancestral recombinant they will not be identified as sharing evidence of the same ancestral recombination event. Is that right? I couldn't see any code in openRDP for reconciling/mapping the multitudes of detectable recombination signals to individual actual recombination events. (2) OpenRDP uses only the vanilla phylpro methods to identify which of the sequences is the recombinant in a triplet of sequences that yield a recombinant signal. Is that right (again could not see any code for doing that the last time I looked)? RDP uses vanilla phylpro together with 17 other metrics (including tree-distance based variations of phylpro) fed through either a crude human-weighted decision tree or a logistic regression weighted machine-learned formula to make this decision (this latter thing was only added recently and may not be in the version you have on hand). (3) OpenRDP uses the breakpoint locations determined by the different detection methods to identify breakpoint sites but RDP uses a sequence triplet-based HMM method to refine the positions of breakpoint sites - also didn't see any code for that in OpenRDP the last time I checked.

Darren

On Tue, Mar 28, 2023 at 3:17 PM Art Poon @.***> wrote:

  • OpenRDP is currently reporting more recombination breakpoints than RDP v5.5 on Windows for small test cases, for all tests
  • @GopiGugan https://github.com/GopiGugan can you please distribute RDP results either here or on Slack so we can work on this together?

— Reply to this email directly, view it on GitHub https://github.com/PoonLab/OpenRDP/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEJ3TW75KDLK7UAQDPDURLW6LQILANCNFSM6AAAAAAWKR6K7A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

darrenpmartin commented 1 year ago

PS sorry I have pointed this out earlier - I assumed you were aware of the differences but didn't want to bother with all the extra coding and execution-time overhead. All that extra stuff is really just to enable RDP to propose a plausible (small) set of recombination events that may have yielded the detected (large) set of recombination signals. This is primarily to make the outputs easier to digest (by a human user) but is also useful when, for example, the program tries to work out whether there is any evidence of recombination hot/cold-spots (in such cases it would be bad to treat every recombination signal as if it represented a unique independent recombination event)

On Tue, Mar 28, 2023 at 3:43 PM Darren Martin @.***> wrote:

This is maybe expected since, as far as i am aware, there are three potentially important differences between OpenRDP and RDP:

(1) OpenRDP does not seem to make an attempt to reconcile recombination events that yielded the detected recombinant sequences with the recombinant signals it detects. From what I understand, with OpenRDP if two sequences in a datasets are both descended from an ancestral recombinant they will not be identified as sharing evidence of the same ancestral recombination event. Is that right? I couldn't see any code in openRDP for reconciling/mapping the multitudes of detectable recombination signals to individual actual recombination events. (2) OpenRDP uses only the vanilla phylpro methods to identify which of the sequences is the recombinant in a triplet of sequences that yield a recombinant signal. Is that right (again could not see any code for doing that the last time I looked)? RDP uses vanilla phylpro together with 17 other metrics (including tree-distance based variations of phylpro) fed through either a crude human-weighted decision tree or a logistic regression weighted machine-learned formula to make this decision (this latter thing was only added recently and may not be in the version you have on hand). (3) OpenRDP uses the breakpoint locations determined by the different detection methods to identify breakpoint sites but RDP uses a sequence triplet-based HMM method to refine the positions of breakpoint sites - also didn't see any code for that in OpenRDP the last time I checked.

Darren

On Tue, Mar 28, 2023 at 3:17 PM Art Poon @.***> wrote:

  • OpenRDP is currently reporting more recombination breakpoints than RDP v5.5 on Windows for small test cases, for all tests
  • @GopiGugan https://github.com/GopiGugan can you please distribute RDP results either here or on Slack so we can work on this together?

— Reply to this email directly, view it on GitHub https://github.com/PoonLab/OpenRDP/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEJ3TW75KDLK7UAQDPDURLW6LQILANCNFSM6AAAAAAWKR6K7A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ArtPoon commented 1 year ago

Ok thanks for the insights @darrenpmartin - it's probably safe to assume that this affects our master branch than. Do you know if these post-processing steps are in the code that you sent?

ArtPoon commented 1 year ago

I'm going to close this issue and make three new issues covering the points raised by @darrenpmartin

darrenpmartin commented 1 year ago

Hi Art - yeah they are all there (look in the "DoRDP" subroutine). In this routine the set of detected recombination signals(call it SetA) is scanned and the recombination signal with the lowest associated p-value is identified, the HMM step is used to "polish" the breakpoint location, 17 other metrics (in addition to phylpro) are calculated and used to decide which sequence is the recombinant. some other stuff is done to identify other sequences carrying evidence of the same recombination signal currently being evaluated, the recombination signal is rechecked with all detection methods in all the "corecombinants," the verified signals are then saved in a master recombination event list (i.e not just a recombination signal list - lets call this SetB), all recombination signals involving the identified set of recombinant sequences are removed from the alignment by splitting each recombinant sequence into two parts (one from the major parent and one from the minor parent), the original recombinant sequence is removed from the alignment and replaced by the two disassembled bits (i.e. for each recombinant sequence one extra sequence gets added to the alignment), all previously detected triplets of sequences that include the recombinants are removed from the list of detected recombination signals (i.e. they are removed from SetA) and every triplet of sequences where a recombination event was detected the now disassembled recombinants is rescanned with all the detection methods (i.e.each of the original triplets is rescanned with each of the different pices of the dissasembled recombinants). the newly detected signals of recombination (if any are detected) are added to the total pool of detected recombination signals (i.e. they are added to SetA) and the process restarts until, eventuially, no more recombination signals remain in SetA. This iterative process of successively stripping away recombination signals from the analysed alignment is how RDP reconciles the (sometimes very) large numbers of initially detected recombination signals (in SetA) within a (usually much smaller) set of actual detected recombination events (the eventual elements of SetB), If you want more detail there is quite a bit more in the manual. Also feel free to call me anytime on +27 832700027 if you would like to ask any questions.

Darren

On Wed, Mar 29, 2023 at 5:32 PM Art Poon @.***> wrote:

Ok thanks for the insights @darrenpmartin https://github.com/darrenpmartin - it's probably safe to assume that this affects our master branch than. Do you know if these post-processing steps are in the code that you sent?

— Reply to this email directly, view it on GitHub https://github.com/PoonLab/OpenRDP/issues/53#issuecomment-1488851478, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEJ3TQEPD5FTZVDQWSP3BTW6RIXXANCNFSM6AAAAAAWKR6K7A . You are receiving this because you were mentioned.Message ID: @.***>

ArtPoon commented 1 year ago

Thanks Darren. I see two declarations of DoRDP:

art@orolo:~/git/OpenRDP/RDP$ grep -r "Sub DoRDP" .
./Module3.bas:Public Sub DoRDP(SPX As Byte, CPermNo As Long)
./Module3b.bas:Public Sub DoRDP(SPX As Byte, CPermNo As Long)

Are these different versions, and if so, which one is current?

darrenpmartin commented 1 year ago

module3b is an old (like 8 years old or something) backup of module3 - I must've accidentally sent you that. Use the module3 version.

On Wed, Mar 29, 2023 at 7:57 PM Art Poon @.***> wrote:

Thanks Darren. I see two declarations of DoRDP:

@.***:~/git/OpenRDP/RDP$ grep -r "Sub DoRDP" ../Module3.bas:Public Sub DoRDP(SPX As Byte, CPermNo As Long)./Module3b.bas:Public Sub DoRDP(SPX As Byte, CPermNo As Long)

Are these different versions, and if so, which one is current?

— Reply to this email directly, view it on GitHub https://github.com/PoonLab/OpenRDP/issues/53#issuecomment-1489055527, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEJ3TXPXOAXPO72GDPGJKLW6RZZBANCNFSM6AAAAAAWKR6K7A . You are receiving this because you were mentioned.Message ID: @.***>