eftsung / pygr

Automatically exported from code.google.com/p/pygr
0 stars 0 forks source link

Initial delay for membership checking in seqdb.BlastDB #49

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I found another slowdown in seqdb.BlastDB. For previous discussion, you can
see postings on http://groups.google.co.kr/group/pygr-dev/t/48c4a3b6d0fec0e6

I made a seqdb.BlastDB, and it has 30 million sequences.

>>> from pygr import seqdb
>>> R1 = seqdb.BlastDB('R1')
>>> R1.has_key('1') # IF I CHECK THE CORRECT SEQUENCE ID, IT IS FAST
True
>>> R1.has_key('3126554') # IF I CHECK THE CORRECT SEQUENCE ID, IT IS FAST
True
>>> R1.has_key('A') # IF I CHECK THE NON-EXISTING SEQUENCE ID, IT TOOK
ABOUT 5 MINUTES TO RETURN RESULTS
False
>>> R1.has_key('B') # BUT IF I CHECK THE NON-EXISTING SEQUENCE ID AGAIN, IT
IS FAST
False
>>> R1.has_key('C') # BUT IF I CHECK THE NON-EXISTING SEQUENCE ID AGAIN, IT
IS FAST
False

I don't know what is happening here, but it means we have to wait a few
minutes if we enter wrong sequence IDs onto seqdb.BlastDB.

Original issue reported on code.google.com by deepr...@gmail.com on 18 Nov 2008 at 10:42

GoogleCodeExporter commented 8 years ago
BlastDB is deprecated.  Use SequenceFileDB instead as the standard sequence 
database
class.

The delay that Namshin reported is due to BlastDB's support for NCBI ID mangling
(i.e. the fact that NCBI blastall reports back "fake ID"s that do not match the
original ID in the FASTA file.  BlastDB handles this by creating a lookup table 
for
translating the fake IDs to the correct IDs.  The first request for an ID that
doesn't match triggers construction of this table, thus the delay.  Note also 
that
construction of this table will take up memory as well.

The solution is simple: switch to using the base class (SequenceFileDB, or
BlastDBbase which adds blast() etc. methods) unless you really need the NCBI ID
mangling support -- in which case this delay is unavoidable.

Original comment by cjlee...@gmail.com on 10 Dec 2008 at 9:37

GoogleCodeExporter commented 8 years ago
OK, I have a better idea.  We can simply restrict this reindexing behavior to 
the
specific operation of looking up IDs during a BLAST search.  We only 
implemented this
behavior to deal with BLAST's buggy mangling of sequence IDs, so there's no 
need to
apply it in other situations.  If it isn't be applied at any other time, 
looking up
an ID that isn't in the database will simply fail (KeyError), with no delay.  

Questions: 
- should we do the initial reindexing at the same time as the formatdb step?  
This
might reduce user annoyance, since users expect formatdb to take some time to 
reindex
the database.

- Should we print out a warning message explaining that we're reindexing the 
BLAST
database?  This might also reduce user annoyance / confusion, by clearing up the
mystery of "why is Pygr so slow?".

- Should we allow the user to turn off reindexing (which means that BLAST will 
not
work on NCBI databases with "mangled blob" IDs)?  

- Can we auto-detect whether reindexing is needed (i.e. detect whether the 
sequence
IDs are blobs that blastall will mangle?).  Then we could dispense with it 
completely
on non-NCBI databases (or more specifically, databases whose IDs blastall won't 
mangle).

Original comment by cjlee...@gmail.com on 11 Dec 2008 at 5:34

GoogleCodeExporter commented 8 years ago
I renamed the reindexing class from BlastDB to BlastIDIndex.  It is now only 
used for
looking up IDs while processing BLAST results in process_blast().

I renamed BlastDBbase to be the new BlastDB.  

Reindexing will never happen in normal usage; only when actually processing 
BLAST
results.  We may still want to consider some of the possible improvements 
listed in
the previous note.

Original comment by cjlee...@gmail.com on 11 Dec 2008 at 11:03

GoogleCodeExporter commented 8 years ago
On further thought, I'm not sure we need to do anything further with this.  
Since
this is now only applied during a BLAST search, reindexing will never be 
triggered
unless BLAST mangles an ID... in which case reindexing is necessary, to rescue 
the
mangled ID!  In a way this *is* "auto-detection" of whether reindexing is 
needed --
much better than always forcing re-indexing (quite possibly unnecessary) during 
the
formatdb step.  

I can only think of one obvious improvement: print a warning message telling 
the user
that Pygr is re-indexing because BLAST mangled an ID, so that users don't 
become too
annoyed by the mysterious delay.

Original comment by cjlee...@gmail.com on 12 Dec 2008 at 7:35

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 21 Feb 2009 at 2:06

GoogleCodeExporter commented 8 years ago
Hi Namshin,
please verify the fix to this bug that you reported, and then change its status 
to
Closed.  We are now requiring that each fix be verified by someone other than 
the
developer who made the fix.

Thanks!

Chris

Original comment by cjlee...@gmail.com on 5 Mar 2009 at 12:21

GoogleCodeExporter commented 8 years ago

Original comment by deepr...@gmail.com on 6 Mar 2009 at 1:56