Closed benvvalk closed 7 years ago
That is one heck of a nasty samtools bug. I've opened an issue at https://github.com/samtools/samtools/issues/474
Thanks for looking into it, Shaun! (And glad to hear they have fixed it.)
I agree with you that it is a bug, btw. Principle of least surprise.
One workaround is to use abyss-index --fai
instead of samtools faidx
.
❯❯❯ abyss-index --fai bar.fa
Reading `bar.fa'...
Writing `bar.fa.fai'...
❯❯❯ cat bar.fa.fai
1 4 3 4 5
1 8 11 8 9
@sjackman Cool, didn't know about that. Thanks!
abyss-index
by default (without the --fai
option) creates both the FAI and FM index.
Fai file are no longer required.
I uncovered an odd bug where the results seem to be affected by the FASTA headers in the reference sequences. Here is a minimal example:
Data
ref.fa:
ref.renamed.fa (same sequences as above, with headers renamed):
query.fa:
Results
Results of querying ref.fa (NO HIT!):
Results of querying ref.renamed.fa (HIT (the correct answer!)):
Explanation
I suspect that the problem is due to a quirk of
samtools faidx
and is not BioBloomTools' fault. For example, compare the following two files:ref.fa.fai:
ref.renamed.fa.fai:
So it is important to make sure that all of the FASTA IDs for the reference sequences are unique. I think for most users that will be the case, but in my application I am using BioBloomTools to map read pairs to read pairs and this problem crops up.
If the issue can't be fixed, I recommend putting some kind of warning in the README about making sure the FASTA names are unique.