Bug in alignment of phased sequences

brunoasm commented 5 years ago

Hi, I found a bug that was mixing sample names during the alignment of phased sequences

The bug shows up if a sequence name ends with a digit and traced it to line 245 of align_sequences.py:

string_to_replace = '%s_'%str(locus_name)
t.id = rest_of_string.replace(string_to_replace, '')

As currently written, let's say one has a sequence named BM1842 for locus #2. In the joined file, the alleles will be named as 2_BM1842_0 | 2 and 2_BM1842_1 | 2. The intended behavior is to remove 2_ at the start, resulting in names BM1842_0 and BM1842_1. The actual result in this case will be BM1840 and BM1841. In my case, there other samples with these IDs, so one can imagine the severity of the problem.

I fixed it by limiting the number of replacements to 1:

t.id = rest_of_string.replace(string_to_replace, '', 1)

But maybe using a regex search anchored at the start of the string would be safer

tandermann commented 5 years ago

Thanks a lot Bruno for reporting this issue and finding a solution! I did as you said and implemented a replacement search anchored at the beginning of the string in order to avoid this issue in the future. Let me know if you run into any more problems!

Best, Tobi

Tobias Andermann (Hofmann) PhD student Biological and Environmental Sciences, University of Gothenburg The Antonelli Lab http://antonelli-lab.net/ | Gothenburg Global Biodiversity Centre http://ggbc.gu.se/ Carl Skottsbergs gata 22 B https://goo.gl/maps/mwGPq3HNzAN2, 413 19 Göteborg tobias.andermann@bioenv.gu.se mailto:tobias.andermann@bioenv.gu.se +46 76 090 1106 github.com/tobiashofmann88 https://github.com/tobiashofmann88 Google Scholar profile https://scholar.google.se/citations?user=soeWAQwAAAAJ&hl=en&oi=ao

On 26 Jan 2019, at 21:21, Bruno de Medeiros notifications@github.com wrote:

Hi, I found a bug that was mixing sample names during the alignment of phased sequences

The bug shows up if a sequence name ends with a digit and traced it to line 245 of align_sequences.py <x-msg://5/seqcap_processor/secapr/align_sequences.py>:

string_toreplace = '%s'%str(locus_name) t.id = rest_of_string.replace(string_to_replace, '') As currently written, let's say one has a sequence named BM1842 for locus #2 https://github.com/AntonelliLab/seqcap_processor/issues/2. In the joined file, the alleles will be named as 2_BM1842_0 | 2 and 2_BM18421 | 2. The intended behavior is to remove 2 at the start, resulting in names BM1842_0 and BM1842_1. The actual result in this case will be BM1840 and BM1841. In my case, there other samples with these IDs, so one can imagine the severity of the problem.

I fixed it by limiting the number of replacements to 1:

t.id = rest_of_string.replace(string_to_replace, '', 1) But maybe using a regex search anchored at the start of the string would be safer

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AntonelliLab/seqcap_processor/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNlRcep4EMJJZOoC3ZWQEbHWtAVj4Igks5vHLjjgaJpZM4aUXT8.

brunoasm commented 5 years ago

great, thanks @tobiashofmann88!

tandermann commented 5 years ago

p.s. the update and some other new features are now available in the newest coda release of secapr (v.1.1.14). I also added some instruction to the end of the secapr installation tutorial about how to update to the GitHub development version, which usually is the most up to date.

Tobias Andermann (Hofmann) PhD student Biological and Environmental Sciences, University of Gothenburg The Antonelli Lab http://antonelli-lab.net/ | Gothenburg Global Biodiversity Centre http://ggbc.gu.se/ Carl Skottsbergs gata 22 B https://goo.gl/maps/mwGPq3HNzAN2, 413 19 Göteborg tobias.andermann@bioenv.gu.se mailto:tobias.andermann@bioenv.gu.se +46 76 090 1106 github.com/tobiashofmann88 https://github.com/tobiashofmann88 Google Scholar profile https://scholar.google.se/citations?user=soeWAQwAAAAJ&hl=en&oi=ao

On 26 Feb 2019, at 21:18, Bruno de Medeiros notifications@github.com wrote:

great, thanks @tobiashofmann88 https://github.com/tobiashofmann88!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AntonelliLab/seqcap_processor/issues/12#issuecomment-467597129, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNlRRecYK4_Q18a1JZyJJA68g_c8xbTks5vRZaugaJpZM4aUXT8.

AntonelliLab / seqcap_processor

Bug in alignment of phased sequences #12