bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools
GNU General Public License v3.0
75 stars 15 forks source link

biobloommicategorizer -- terminate called after throwing an instance of 'std::out_of_range' #43

Closed mcmero closed 5 years ago

mcmero commented 5 years ago

I'm running into a core dump error while running biobloommicategorizer (v2.3.2). I'm creating a mask like so:

biobloommimaker -p reads cosmic_transcripts.fasta -g 56 -t 8

Then running (as part of the tap pipeline):

biobloommicategorizer -t 8 -f reads.bf --fq -p ./bbt_2.3.2/sample -e sample_R1.fastq.gz sample_R2.fastq.gz

I then run into this error (I've cut some of the output):

Loading sdsl interleaved bit vector from: reads.bf.sdsl
Loading header...
Loaded header... magic: MIBLOOMF hlen: 32 size: 273716032 nhash: 56 kmer: 25
Loading data vector
Bit Vector Size: 1470772224
Popcount: 273716032
...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::at: __n (which is 18446744073709551615) >= this->size() (which is 148)
Aborted (core dumped)

My fastq files have different read lengths, so I am wondering whether this might be related.

JustinChu commented 5 years ago

I'll look into why this is occuring. Since it is occuring in the beginning I should be able to fix it quickly. Can you send me the first 20 or so lines of your fasta files? I also may need a copy of your cosmic_transcripts.fasta file.

Unrelated but that is a crazy amount of hash functions (-g).

mcmero commented 5 years ago

Thanks Justin. Here's what my fastq files look like: R1:

@SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
CTTGNACCACAGCGATGAAGGTGGACGCGATCTGGGTGTGATGGTGCCGGGTCTCCAGGGCTGCAGTCACTGCCTGGTCAATGTTTCTCTCCACTGTCGT
+SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
@@@D#2ABFHF<?EHFGEGIIGHEIIGGGHBHCHIIBAAAGHIGDE@DEFE5>CCCCCCBB?;@CC>>CC:ACC::?:AACACC@CCCCCC94<>ACCB2
@SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
TTTGAAGATGCCGCATTTGGATTGGATGAATTCCAAATTCTGCTTGCTTGCTTTTTAATATTGATATGCTTATACACTTACACTTTATGCACAAAATGTA
+SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
????DBDB:22+0)@8<AFE9CEDDEDEEEIIDDEEDIIIECEEIIDCDD>EBDDCCCBCEI<AAEIIEEIEIIDD;ADDA>;?ADABAADAA???AA>B
@SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
GAAATTTGTGTGGTTCGCTTCACACCAGTAACTGAAGAAGATCAAATTTCTTATACTTTGCTCTTTGCATACTTCAGTAGCAGAAAGCGCTATGGAGTAN
+SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
CCCFFFFFHHHHHIJJIJJJJJJJJJJJIJJJIJJJJJJJJJJJJJJJJIJJIJIJJJJIIJJJJJJJJJJHHHHHFHFFFFFFEEEEDDDDDDDDDCCC
@SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
CTGACCTGACCTGCTCTGGGATCCGCTGTGTGCGTCACGCTTGCCTGAAACATGAACAGAATCGGCAAGACCTGGTGAAAGCTGGCGTGCTGCCTCTGCT
+SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
BBCFDDDFDHHHFJJIJJJJIJIJIJIIEGDHDHHJJIJJIJJGGIJGIGHIGG>ECEEHA3=ADFDDDDDDDCD:5@@C@?CAA8>09B?@ABCA>>C>
@SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CAATTATATTACTACCACTGACATGACTTTCCAAAAAAAACACATAATTTGAATCAACACAACCACCCACAGCCTAATTATTAGCATCATCCCTCTACTA
+SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CCCFFFFFHHHGGIJJJJJJJIJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJIHHHHFFFDDDDDDDDDDDEEEEDDDDEDDDDDDDDDDDEA

R2

@SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
CTTGNACCACAGCGATGAAGGTGGACGCGATCTGGGTGTGATGGTGCCGGGTCTCCAGGGCTGCAGTCACTGCCTGGTCAATGTTTCTCTCCACTGTCGT
+SRR949078.1 HWI-ST942:86:D094KACXX:1:1101:1108:2117 length=100
@@@D#2ABFHF<?EHFGEGIIGHEIIGGGHBHCHIIBAAAGHIGDE@DEFE5>CCCCCCBB?;@CC>>CC:ACC::?:AACACC@CCCCCC94<>ACCB2
@SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
TTTGAAGATGCCGCATTTGGATTGGATGAATTCCAAATTCTGCTTGCTTGCTTTTTAATATTGATATGCTTATACACTTACACTTTATGCACAAAATGTA
+SRR949078.2 HWI-ST942:86:D094KACXX:1:1101:1121:2136 length=100
????DBDB:22+0)@8<AFE9CEDDEDEEEIIDDEEDIIIECEEIIDCDD>EBDDCCCBCEI<AAEIIEEIEIIDD;ADDA>;?ADABAADAA???AA>B
@SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
GAAATTTGTGTGGTTCGCTTCACACCAGTAACTGAAGAAGATCAAATTTCTTATACTTTGCTCTTTGCATACTTCAGTAGCAGAAAGCGCTATGGAGTAN
+SRR949078.3 HWI-ST942:86:D094KACXX:1:1101:1108:2161 length=100
CCCFFFFFHHHHHIJJIJJJJJJJJJJJIJJJIJJJJJJJJJJJJJJJJIJJIJIJJJJIIJJJJJJJJJJHHHHHFHFFFFFFEEEEDDDDDDDDDCCC
@SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
CTGACCTGACCTGCTCTGGGATCCGCTGTGTGCGTCACGCTTGCCTGAAACATGAACAGAATCGGCAAGACCTGGTGAAAGCTGGCGTGCTGCCTCTGCT
+SRR949078.4 HWI-ST942:86:D094KACXX:1:1101:1185:2168 length=100
BBCFDDDFDHHHFJJIJJJJIJIJIJIIEGDHDHHJJIJJIJJGGIJGIGHIGG>ECEEHA3=ADFDDDDDDDCD:5@@C@?CAA8>09B?@ABCA>>C>
@SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CAATTATATTACTACCACTGACATGACTTTCCAAAAAAAACACATAATTTGAATCAACACAACCACCCACAGCCTAATTATTAGCATCATCCCTCTACTA
+SRR949078.5 HWI-ST942:86:D094KACXX:1:1101:1138:2172 length=100
CCCFFFFFHHHGGIJJJJJJJIJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJIHHHHFFFDDDDDDDDDDDEEEEDDDDEDDDDDDDDDDDEA

Here's the reference: cosmic_transcripts.fasta.gz

Is there a recommended range for the hash function?

(edit -- posted wrong read files)

JustinChu commented 5 years ago

Typically no more than 8, 3-6 range would be standard. In the TAP pipeline I thought it used 4 spaced seeds rather than k-mers.

@readmanchiu Does the TAP pipeline automatically make the filter with spaced seeds or is it manual?

JustinChu commented 5 years ago

The problem does seem to be due to read length issues. For a quick fix for now you could try filtering out all read pairs with one read shorter than the k-mer size (in your example 25).

Regarding the targets file: If you want reads per gene (rather than per transcript), it may make sense to group the transcripts by gene first in seperate files and use the -F option (group by file).

readmanchiu commented 5 years ago

TAP runs biobloommicategorizer (or biobloomcategorizer in previous versions). It's hard-coded and assumes space seed filtered, and the command is not configurable. The onus is on the user to make sure the bloomfilter is compatible with categorizer command

JustinChu commented 5 years ago

Hmmm... It may make sense to add at least one default/example set of seeds somewhere in the documentation.

Is this linked to users somewhere in the TAP documentation?

JustinChu commented 5 years ago

I think I've fixed it and pushed a fix to the master branch. @mcmero if possible can you send me the read that it fails at? It seems that the reads you posted don't actually have any problems with them. I started testing with some variable length reads and fixed some issues but I want to make sure everything is resolved. Thanks.

mcmero commented 5 years ago

I tried running it again with the latest fix, as well as a regenerated bloomfilter using group by file and space seed filtering. It's now working without issues. Thanks.