bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools
GNU General Public License v3.0
75 stars 15 forks source link

results affected by FASTA headers #7

Closed benvvalk closed 7 years ago

benvvalk commented 9 years ago

I uncovered an odd bug where the results seem to be affected by the FASTA headers in the reference sequences. Here is a minimal example:

Data

ref.fa:

>ERR365834.4916141      HWI-M01013:41:000000000-A5KUT:1:2107:9861:26375/1
ACCTTAAAGCTTTTTACCCCTCTTAAGTTATATTTCATACAAAACACAACACACACAAAAAATTGCCCCTTAAAGAAAAGTCACAAGATTTAATATATATGCCAAATTATTGAACCTATCATCCTACTATATTATTTGGGAAGTACATTTTAAATGTAACCTTAATTTTTGTAAGTAATTACATAGGTATGCCA
>ERR365834.4916141      HWI-M01013:41:000000000-A5KUT:1:2107:9861:26375/2
TAACAACATCAAATTTGACCACGAGGTCATACCTGAAATGGATTGCTGTGAAACCGGTACAGACTCTTTCACAGCTATTCAGAACTGCACTCAGTTAGCTGATAAAATTTGCAAGTCTTCGTGGGGGCATGCAGATGATGTGCGTATTGACCAATATAG

ref.renamed.fa (same sequences as above, with headers renamed):

>ERR365834.4916141/1
ACCTTAAAGCTTTTTACCCCTCTTAAGTTATATTTCATACAAAACACAACACACACAAAAAATTGCCCCTTAAAGAAAAGTCACAAGATTTAATATATATGCCAAATTATTGAACCTATCATCCTACTATATTATTTGGGAAGTACATTTTAAATGTAACCTTAATTTTTGTAAGTAATTACATAGGTATGCCA
>ERR365834.4916141/2
TAACAACATCAAATTTGACCACGAGGTCATACCTGAAATGGATTGCTGTGAAACCGGTACAGACTCTTTCACAGCTATTCAGAACTGCACTCAGTTAGCTGATAAAATTTGCAAGTCTTCGTGGGGGCATGCAGATGATGTGCGTATTGACCAATATAG

query.fa:

>SRR519624.2022476
TAGTAGGATGATAGGTTCAATAATTTGGCATATATATTAAATCTTGTGACTTTTCTTTAAGGGGCAATTTTTTGTGTGTGTTGTGTTTTGTATGATATAT

Results

Results of querying ref.fa (NO HIT!):

$ samtools faidx ref.fa
$ biobloommaker -k 15 -p ref -f 0.001 -n 100000 ref.fa
$ biobloomcategorizer -d ref -f ref.bf -s 0.2 query.fa 2>&1 >/dev/null | tail -5 | column -t
filter_id   hits  misses  shared  rate_hit  rate_miss  rate_shared
ref         0     1       0       0         1          0
multiMatch  0     1       0       0         1          0
noMatch     1     0       0       1         0          0

Results of querying ref.renamed.fa (HIT (the correct answer!)):

$ samtools faidx ref.renamed.fa
$ biobloommaker -k 15 -p ref.renamed -f 0.001 -n 100000 ref.renamed.fa
$ biobloomcategorizer -d ref.renamed -f ref.renamed.bf -s 0.2 query.fa 2>&1 >/dev/null | tail -5 | column -t
filter_id    hits  misses  shared  rate_hit  rate_miss  rate_shared
ref.renamed  1     0       0       1         0          0
multiMatch   0     1       0       0         1          0
noMatch      0     1       0       0         1          0

Explanation

I suspect that the problem is due to a quirk of samtools faidx and is not BioBloomTools' fault. For example, compare the following two files:

ref.fa.fai:

ERR365834.4916141       159     333     159     160
ERR365834.4916141       159     333     159     160

ref.renamed.fa.fai:

ERR365834.4916141/1     194     21      194     195
ERR365834.4916141/2     159     237     159     160

So it is important to make sure that all of the FASTA IDs for the reference sequences are unique. I think for most users that will be the case, but in my application I am using BioBloomTools to map read pairs to read pairs and this problem crops up.

If the issue can't be fixed, I recommend putting some kind of warning in the README about making sure the FASTA names are unique.

sjackman commented 8 years ago

That is one heck of a nasty samtools bug. I've opened an issue at https://github.com/samtools/samtools/issues/474

benvvalk commented 8 years ago

Thanks for looking into it, Shaun! (And glad to hear they have fixed it.)

I agree with you that it is a bug, btw. Principle of least surprise.

sjackman commented 8 years ago

One workaround is to use abyss-index --fai instead of samtools faidx.

❯❯❯ abyss-index --fai bar.fa
Reading `bar.fa'...
Writing `bar.fa.fai'...
❯❯❯ cat bar.fa.fai
1   4   3   4   5
1   8   11  8   9
benvvalk commented 8 years ago

@sjackman Cool, didn't know about that. Thanks!

sjackman commented 8 years ago

abyss-index by default (without the --fai option) creates both the FAI and FM index.

JustinChu commented 7 years ago

Fai file are no longer required.