Closed GoogleCodeExporter closed 8 years ago
BlastDB is deprecated. Use SequenceFileDB instead as the standard sequence
database
class.
The delay that Namshin reported is due to BlastDB's support for NCBI ID mangling
(i.e. the fact that NCBI blastall reports back "fake ID"s that do not match the
original ID in the FASTA file. BlastDB handles this by creating a lookup table
for
translating the fake IDs to the correct IDs. The first request for an ID that
doesn't match triggers construction of this table, thus the delay. Note also
that
construction of this table will take up memory as well.
The solution is simple: switch to using the base class (SequenceFileDB, or
BlastDBbase which adds blast() etc. methods) unless you really need the NCBI ID
mangling support -- in which case this delay is unavoidable.
Original comment by cjlee...@gmail.com
on 10 Dec 2008 at 9:37
OK, I have a better idea. We can simply restrict this reindexing behavior to
the
specific operation of looking up IDs during a BLAST search. We only
implemented this
behavior to deal with BLAST's buggy mangling of sequence IDs, so there's no
need to
apply it in other situations. If it isn't be applied at any other time,
looking up
an ID that isn't in the database will simply fail (KeyError), with no delay.
Questions:
- should we do the initial reindexing at the same time as the formatdb step?
This
might reduce user annoyance, since users expect formatdb to take some time to
reindex
the database.
- Should we print out a warning message explaining that we're reindexing the
BLAST
database? This might also reduce user annoyance / confusion, by clearing up the
mystery of "why is Pygr so slow?".
- Should we allow the user to turn off reindexing (which means that BLAST will
not
work on NCBI databases with "mangled blob" IDs)?
- Can we auto-detect whether reindexing is needed (i.e. detect whether the
sequence
IDs are blobs that blastall will mangle?). Then we could dispense with it
completely
on non-NCBI databases (or more specifically, databases whose IDs blastall won't
mangle).
Original comment by cjlee...@gmail.com
on 11 Dec 2008 at 5:34
I renamed the reindexing class from BlastDB to BlastIDIndex. It is now only
used for
looking up IDs while processing BLAST results in process_blast().
I renamed BlastDBbase to be the new BlastDB.
Reindexing will never happen in normal usage; only when actually processing
BLAST
results. We may still want to consider some of the possible improvements
listed in
the previous note.
Original comment by cjlee...@gmail.com
on 11 Dec 2008 at 11:03
On further thought, I'm not sure we need to do anything further with this.
Since
this is now only applied during a BLAST search, reindexing will never be
triggered
unless BLAST mangles an ID... in which case reindexing is necessary, to rescue
the
mangled ID! In a way this *is* "auto-detection" of whether reindexing is
needed --
much better than always forcing re-indexing (quite possibly unnecessary) during
the
formatdb step.
I can only think of one obvious improvement: print a warning message telling
the user
that Pygr is re-indexing because BLAST mangled an ID, so that users don't
become too
annoyed by the mysterious delay.
Original comment by cjlee...@gmail.com
on 12 Dec 2008 at 7:35
Original comment by mare...@gmail.com
on 21 Feb 2009 at 2:06
Hi Namshin,
please verify the fix to this bug that you reported, and then change its status
to
Closed. We are now requiring that each fix be verified by someone other than
the
developer who made the fix.
Thanks!
Chris
Original comment by cjlee...@gmail.com
on 5 Mar 2009 at 12:21
Original comment by deepr...@gmail.com
on 6 Mar 2009 at 1:56
Original issue reported on code.google.com by
deepr...@gmail.com
on 18 Nov 2008 at 10:42