Closed brunoasm closed 5 years ago
Thanks a lot Bruno for reporting this issue and finding a solution! I did as you said and implemented a replacement search anchored at the beginning of the string in order to avoid this issue in the future. Let me know if you run into any more problems!
Best, Tobi
On 26 Jan 2019, at 21:21, Bruno de Medeiros notifications@github.com wrote:
Hi, I found a bug that was mixing sample names during the alignment of phased sequences
The bug shows up if a sequence name ends with a digit and traced it to line 245 of align_sequences.py <x-msg://5/seqcap_processor/secapr/align_sequences.py>:
string_toreplace = '%s'%str(locus_name) t.id = rest_of_string.replace(string_to_replace, '') As currently written, let's say one has a sequence named BM1842 for locus #2 https://github.com/AntonelliLab/seqcap_processor/issues/2. In the joined file, the alleles will be named as 2_BM1842_0 | 2 and 2_BM18421 | 2. The intended behavior is to remove 2 at the start, resulting in names BM1842_0 and BM1842_1. The actual result in this case will be BM1840 and BM1841. In my case, there other samples with these IDs, so one can imagine the severity of the problem.
I fixed it by limiting the number of replacements to 1:
t.id = rest_of_string.replace(string_to_replace, '', 1) But maybe using a regex search anchored at the start of the string would be safer
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AntonelliLab/seqcap_processor/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNlRcep4EMJJZOoC3ZWQEbHWtAVj4Igks5vHLjjgaJpZM4aUXT8.
great, thanks @tobiashofmann88!
p.s. the update and some other new features are now available in the newest coda release of secapr (v.1.1.14). I also added some instruction to the end of the secapr installation tutorial about how to update to the GitHub development version, which usually is the most up to date.
On 26 Feb 2019, at 21:18, Bruno de Medeiros notifications@github.com wrote:
great, thanks @tobiashofmann88 https://github.com/tobiashofmann88!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AntonelliLab/seqcap_processor/issues/12#issuecomment-467597129, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNlRRecYK4_Q18a1JZyJJA68g_c8xbTks5vRZaugaJpZM4aUXT8.
Hi, I found a bug that was mixing sample names during the alignment of phased sequences
The bug shows up if a sequence name ends with a digit and traced it to line 245 of align_sequences.py:
As currently written, let's say one has a sequence named BM1842 for locus #2. In the joined file, the alleles will be named as
2_BM1842_0 | 2
and2_BM1842_1 | 2
. The intended behavior is to remove2_
at the start, resulting in namesBM1842_0
andBM1842_1
. The actual result in this case will beBM1840
andBM1841
. In my case, there other samples with these IDs, so one can imagine the severity of the problem.I fixed it by limiting the number of replacements to 1:
But maybe using a regex search anchored at the start of the string would be safer