PoonLab / OpenRDP

An open-source re-implementation of the RDP4 recombination detection program
GNU General Public License v3.0
45 stars 9 forks source link

DNA_ALPHABET does not include IUPAC symbols #46

Open ArtPoon opened 1 year ago

ArtPoon commented 1 year ago

https://github.com/PoonLab/OpenRDP/blob/d5cb63962586120b65f22ba7d5d2345969e24517/openrdp/__init__.py#L29

This can cause problems because sequences with mixtures (e.g., N or R) will cause the program to exit! https://github.com/PoonLab/OpenRDP/blob/d5cb63962586120b65f22ba7d5d2345969e24517/openrdp/__init__.py#L315-L316

ArtPoon commented 1 year ago

Things might be more complicated than this - do any of the RDP methods reject sequences with mixtures?

darrenpmartin commented 1 year ago

If you're going tp do something to handle symbols other than ATGC- you should probably write separate versions of the main triplet scanning loops to do this - handling these in the inner loop will be very expensive.

On Wed, Feb 8, 2023 at 5:27 AM Art Poon @.***> wrote:

Things might be more complicated than this - do any of the RDP methods reject sequences with mixtures?

— Reply to this email directly, view it on GitHub https://github.com/PoonLab/OpenRDP/issues/46#issuecomment-1421940041, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEJ3TVOGK2B7FU4DKRMCZDWWMHDTANCNFSM6AAAAAAUUXQCPQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ArtPoon commented 1 year ago

Should be a simple modification of this function: https://github.com/PoonLab/OpenRDP/blob/d5cb63962586120b65f22ba7d5d2345969e24517/openrdp/common.py#L46-L52

Have not yet found where characters - and * are being handled.

ArtPoon commented 1 year ago

Mixtures can be resolved or averaged at a pre-processing step, before looping over triplets.

ArtPoon commented 1 year ago

I just noticed that the test statement in __init__.py uses an and instead of an or, so it is letting through sequences that contain invalid characters anyhow!