DavidsonGroup / flexiplex

The Flexible Demultiplexer
https://davidsongroup.github.io/flexiplex/
MIT License
26 stars 2 forks source link

Panic due to out of range error when finding the substring #43

Closed olliecheng closed 4 months ago

olliecheng commented 7 months ago

Moving issue here so it’s easier to find and document.

Original question by @ashokpatowary:

I have another follow up question. If i try running flexiplex with -u ???????? -b ?????????? and -k with barcode file (barcode sequences 10bp) it throws the following error; but if I reduce the "-b" wild character length to 7 "?" with barcode files having 10bp barcodes; it works. Any suggestion whats going on

Setting max barcode edit distance to 2
Setting number of threads to 24
For usage information type: flexiplex -h
No filename given... getting reads from stdin...
Searching for barcodes...
terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr
Aborted

Originally posted by @ashokpatowary in https://github.com/DavidsonGroup/flexiplex/issues/41#issuecomment-2033577573

nadiadavidson commented 7 months ago

Here is a toy example of my error, which may be relates, and also how I fixed it. Not very elegant, but might be useful to you @ashokpatowary while @olliecheng is looking into the problem.

Data: example.fastq

@Read1
ATGGGTTATGCAGCATTAGTGGCCGATGTTTCGCATCGGCGTACGACTAAGAGATCATCCACGTGCTTGAGCTGCACATTATCACAGCTTGCAGGTGGATTCTCTTGCTTAGTTACCTTTTTCCATGTTTTGTTTTCTGGTGGTGCTGTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Running: cat example.fastq | $FLEXIPLEX -u "??????????" -b "????????" -x "GTGGCCGATGTT" -k "TATCAGCA" -f 1 -e 1 dies with :

FLEXIPLEX 1.01
Setting UMI to search for: ??????????
Setting barcode to search for: ????????
Adding flank sequence to search for: GTGGCCGATGTT
Setting known barcodes from TATCAGCA
Number of known barcodes: 1
Setting max flanking sequence edit distance to 1
Setting max barcode edit distance to 1
For usage information type: flexiplex -h
No filename given... getting reads from stdin...
Searching for barcodes...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
Aborted (core dumped)

But if I add some buffer sequence at the start, I don't get the error: cat example.fastq | sed "/[@,+]/! s/^/987654321/g" | $FLEXIPLEX -x "321" -u "??????????" -b "????????" -x "GTGGCCGATGTT" -k "TATCAGCA" -f 1 -e 1


FLEXIPLEX 1.01
Adding flank sequence to search for: 321
Setting UMI to search for: ??????????
Setting barcode to search for: ????????
Adding flank sequence to search for: GTGGCCGATGTT
Setting known barcodes from TATCAGCA
Number of known barcodes: 1
Setting max flanking sequence edit distance to 1
Setting max barcode edit distance to 1
For usage information type: flexiplex -h
No filename given... getting reads from stdin...
Searching for barcodes...
@TATCAGCA_321ATGGGTT#LH00144:80:22HFNLLT3:5:2243:16607:9844_+1of1
TTCGCATCGGCGTACGACTAAGAGATCATCCACGTGCTTGAGCTGCACATTATCACAGCTTGCAGGTGGATTCTCTTGCTTAGTTACCTTTTTCCATGTTTTGTTTTCTGGTGGTGCTGTG
+TATCAGCA_321ATGGGTT#LH00144:80:22HFNLLT3:5:2243:16607:9844_+1of1
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Number of reads processed: 1
Number of reads where a barcode was found: 1
Number of reads where more than one barcode was found: 0
All done!
ChangqingW commented 7 months ago

Here is a toy example of my error, which may be relates, and also how I fixed it. Not very elegant, but might be useful to you @ashokpatowary while @olliecheng is looking into the problem.

Data: example.fastq

@Read1
ATGGGTTATGCAGCATTAGTGGCCGATGTTTCGCATCGGCGTACGACTAAGAGATCATCCACGTGCTTGAGCTGCACATTATCACAGCTTGCAGGTGGATTCTCTTGCTTAGTTACCTTTTTCCATGTTTTGTTTTCTGGTGGTGCTGTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Running: cat example.fastq | $FLEXIPLEX -u "??????????" -b "????????" -x "GTGGCCGATGTT" -k "TATCAGCA" -f 1 -e 1 dies with :

FLEXIPLEX 1.01
Setting UMI to search for: ??????????
Setting barcode to search for: ????????
Adding flank sequence to search for: GTGGCCGATGTT
Setting known barcodes from TATCAGCA
Number of known barcodes: 1
Setting max flanking sequence edit distance to 1
Setting max barcode edit distance to 1
For usage information type: flexiplex -h
No filename given... getting reads from stdin...
Searching for barcodes...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
Aborted (core dumped)

But if I add some buffer sequence at the start, I don't get the error: cat example.fastq | sed "/[@,+]/! s/^/987654321/g" | $FLEXIPLEX -x "321" -u "??????????" -b "????????" -x "GTGGCCGATGTT" -k "TATCAGCA" -f 1 -e 1


FLEXIPLEX 1.01
Adding flank sequence to search for: 321
Setting UMI to search for: ??????????
Setting barcode to search for: ????????
Adding flank sequence to search for: GTGGCCGATGTT
Setting known barcodes from TATCAGCA
Number of known barcodes: 1
Setting max flanking sequence edit distance to 1
Setting max barcode edit distance to 1
For usage information type: flexiplex -h
No filename given... getting reads from stdin...
Searching for barcodes...
@TATCAGCA_321ATGGGTT#LH00144:80:22HFNLLT3:5:2243:16607:9844_+1of1
TTCGCATCGGCGTACGACTAAGAGATCATCCACGTGCTTGAGCTGCACATTATCACAGCTTGCAGGTGGATTCTCTTGCTTAGTTACCTTTTTCCATGTTTTGTTTTCTGGTGGTGCTGTG
+TATCAGCA_321ATGGGTT#LH00144:80:22HFNLLT3:5:2243:16607:9844_+1of1
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Number of reads processed: 1
Number of reads where a barcode was found: 1
Number of reads where more than one barcode was found: 0
All done!

I covered this case, and will pad the missing bit of UMI with Ns:

Searching for barcodes...
@TATCAGCA_NNNATGGGTT#Read1_+1of1
TTCGCATCGGCGTACGACTAAGAGATCATCCACGTGCTTGAGCTGCACATTATCACAGCTTGCAGGTGGATTCTCTTGCTTAGTTACCTTTTTCCATGTTTTGTTTTCTGGTGGTGCTGTG
+TATCAGCA_NNNATGGGTT#Read1_+1of1
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Number of reads processed: 1
Number of reads where a barcode was found: 1
Number of reads where more than one barcode was found: 0
All done!

But apparently @ashokpatowary still ran into an edge case after 0.5 million reads in their data.

ashokpatowary commented 7 months ago

@nadiadavidson Thanks for the suggestion.

Considering that there might be truncation of UMI or 2nd barcode in the ONT data; I ran the same command on the data generated from PacBio for the same library. PacBio data was very clean and barcode and UMIs can be identified even without edit distance. However same error was encountered in that data also.

ChangqingW commented 7 months ago

Hi @ashokpatowary , there should be a core dump file in the folder where you ran the flexiplex command if you have set ulimit -c unlimited before running, could you make sure you are using the latest binary from this branch and share the core dump file so we can debug where the bug occurred? Or alternatively can we have access your data to reproduce the bug?

ashokpatowary commented 7 months ago

@ChangqingW please find the file attached; created when -u was "0" core.204719.zip

ChangqingW commented 7 months ago

Ops, I did not realize the cpp library versions need to match for gdb to work, apparently we are running on different ones. Could you do gdb bin/flexiplex-linux core.204719, then bt inside the gdb debugger? If there is a function call named with something like substr, could you do up until you are at that stack, and do info args and info locals?

olliecheng commented 6 months ago

Hi @ashokpatowary, just as a precaution, are you indeed sure that you are running on the version of Flexiplex listed in @ChangqingW's branch? If you're compiling on your own machine, after changing branch you may need to make clean before recompiling with make. I say this as strings core.204719 | grep '[0-9]\.[0-9][0-9]' suggests you may potentially be running Flexiplex 0.97.1 instead of the latest version is 1.01, which is also the version number in ChangqingW's patch. Of course, this is a crude method of checking and finding-strings-from-bytecode is very inconsistent.

If that's not the culprit, are you possibly able to share the binary you ran? (No need to do the debugging mentioned above, if you can provide a matching binary.) None of the recent binaries released on our GitHub or in @ChangqingW 's fork match up, according to gdb. Of course, if you could send a dataset containing the error, that would be the best way for us to resolve this issue!

@ChangqingW - the libc versions need to match up, but luckily, the WEHI HPC is still on libc2.17 as well. I ran into an issue acquiring libc debug symbols without root access, so instead, I used the quay.io/pypa/manylinux2014_x86_64 image through a local Docker container, which runs on RHEL and contains the correct libc version.

[root@24e652f261cf flexiplex]# gdb flexiplex-linux core.204719
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/flexiplex/flexiplex-linux...done.

warning: exec file is newer than core file.
[New LWP 204722]
[New LWP 204719]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/u/home/a/ashokpat/project-geschwind/Tools/flexiplex/bin/flexiplex-linux -k /u/'.
Program terminated with signal 6, Aborted.
#0  0x00002af63f8ea387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) q
ChangqingW commented 4 months ago

I wonder if we should consider this as fixed by the pr and merge, I think all string subset calls are covered. In any case we can always reopen if it happened again.

olliecheng commented 4 months ago

Good idea, I'm happy to consider this fixed by #42.