eftsung / pygr

Automatically exported from code.google.com/p/pygr
0 stars 0 forks source link

BlastDB has gotten slow due to cache #45

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Create FASTA file with 50 million sequences in it

outfile = open('R1', 'w')
for icount in range(1, 50000000):
    outfile.write('>' + str(icount) + '\n')
    outfile.write('ACGT\n')
outfile.close()

2. Open that FASTA (requires BlastDB building too)

from pygr import seqdb
R1 = seqdb.BlastDB('R1')

What is the expected output? What do you see instead?

Openning R1 should be fast without preloading sequence IDs. But, currently 
BlastDB loads every sequence IDs into memory and it takes several minutes 
to just open BlastDB. And that affects performance of NLMSA.

Please use labels and text to provide additional information.

1. Version as of August 13.

>>> seqdb.BlastDB('R1')
{}

Less than 1 sec. Returns empty dict.

2. Version as of Today.

>>> seqdb.BlastDB('R1')
<BlastDBbase 'R1'>

Took several minutes and load all indice into memory.

Original issue reported on code.google.com by deepr...@gmail.com on 14 Oct 2008 at 11:22

GoogleCodeExporter commented 8 years ago
It has nothing to do with BlastDB building. In step 2 above, please build 
BlastDB 
first and open it with seqdb.BlastDB.

Original comment by deepr...@gmail.com on 14 Oct 2008 at 11:28

GoogleCodeExporter commented 8 years ago
bsddb.btopen('R1.seqlen', 'r')

bsddb.btopen has the problem. Whenever I tried to open R1.seqlen, it preload 
all 
indexes into memory.

Original comment by deepr...@gmail.com on 21 Oct 2008 at 2:26

GoogleCodeExporter commented 8 years ago
Hmm, in my initial test, the time for indexing a file is the same in the current
version and the August 8 version (git commit 11e3814).  In both cases, it took 
30 sec
(on my macbook pro) to index a file of 1 million sequences.

One difference that I see between the older version and the new version is that 
at
the very end of the indexing process I see memory usage expanding rapidly (from
around 5 MB to at least 35 MB), then quickly dropping down to baseline (5 MB).  
In
the older version I didn't see any such memory usage surge.  If we extrapolate 
from
30 MB for 1 million sequences, your case of 50 million sequences might take 1.5 
GB,
which could easily send the machine into swap hell, which could make the 
process take
much longer than it should.  So this seems to fit with what you reported...

OK.  I now understand the problem.  The bsddb module btree index is screwing us 
over:
when you simply ask for an iterator, it apparently loads the entire index into
memory.  Anyway, just doing the following causes the 30 MB increase in memory 
usage I
mentioned above:

>>> s2 = classutil.open_shelve('R1.seqlen','r')
>>> it = iter(s2)
>>> seqID = it.next()

The memory increase happens when you ask the iterator for the first item, and 
the
memory isn't released until the iterator is garbage collected.

The reason this problem was NOT present in earlier versions of Pygr, is that we 
used
to have a function read_fasta_one_line() that just read the first sequence line 
of
the FASTA file.  BlastDB.set_seqtype() used that function to read a line of 
sequence,
and then to infer when the sequence is protein or nucleotide.

When we made seqdb more modular (created SequenceDB class), I got rid of
read_fasta_one_line() as being too limited (only works on FASTA format), and 
switched
to just getting the first sequence by getting an iterator on the sequence 
database. 
Now we discover that bsddb iterators act more like keys() (i.e. reads the entire
index into memory) than like an iterator...  They are NOT scalable!!!!

You claim that the older version of Pygr can index a file of 50 million 
sequences in
1 sec.  I guess that might be possible, but it seems much faster than I'd 
expect. 
Are you sure that you tested indexing of the file, as opposed to just opening an
index that has already been constructed?

Original comment by cjlee...@gmail.com on 21 Oct 2008 at 2:40

GoogleCodeExporter commented 8 years ago
I switched back to using read_fast_one_line() to avoid using bsddb iterator for
initial set_seqtype().

Original comment by cjlee...@gmail.com on 21 Oct 2008 at 3:01

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 21 Feb 2009 at 2:06

GoogleCodeExporter commented 8 years ago
Hi Namshin,
please verify the fix to this bug that you reported, and then change its status 
to
Closed.  We are now requiring that each fix be verified by someone other than 
the
developer who made the fix.

Thanks!

Chris

Original comment by cjlee...@gmail.com on 5 Mar 2009 at 12:05

GoogleCodeExporter commented 8 years ago

Original comment by deepr...@gmail.com on 6 Mar 2009 at 1:55