eftsung / pygr

Automatically exported from code.google.com/p/pygr
0 stars 0 forks source link

Multi-sequence blast support #79

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
See Alex's feature request on the discussion group.
http://groups.google.com/group/pygr-dev/browse_thread/thread/f03601c835c6bc57/a6
18b951e00dd7e1#a618b951e00dd7e1

I implemented his multi-sequence blast feature on pygr.blast.BlastMapping.
 All you have to do is pass a queryDB argument (which should be a
dictionary of sequences with IDs, just as Alex proposed) when you call the
BlastMapping object (i.e. the BlastMapping interface is simply that you use
the object as a callable and get back the alignment object as the return
value).  Take a look at pygr/tests/blast_test.py's test_multiblast() for an
example of how to do this.  

Currently, this feature is only available for BlastMapping (which handles
blastn, blastp, and tblastn), but not yet for MegablastMapping or
BlastxMapping (which handles blastx and tblastx).

Original issue reported on code.google.com by cjlee...@gmail.com on 27 Mar 2009 at 4:51

GoogleCodeExporter commented 8 years ago
Debugging now. Having troubles frequent key errors:

Exception in thread Thread-1:
Traceback (most recent call last):
  File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threadin
g.py", line
460, in __bootstrap
    self.run()
  File
"/Users/alexander/inc/pygr_test/pygr/build/lib.macosx-10.5-i386-2.5/pygr/blast.p
y",
line 56, in run
    for seqID, seq in self.seqDB.iteritems():
  File "seqdb.py", line 299, in iteritems
    yield seqID, self[seqID]
  File "seqdb.py", line 314, in __getitem__
    self._weakValueDict[seqID] = s # save in cache.
  File "classutil.py", line 392, in __setitem__
    self.keep_this(v)
  File "classutil.py", line 389, in keep_this
    del self._keepDict[vmin]

Is RecentValueDictionary thread safe?

Original comment by alexande...@gmail.com on 31 Mar 2009 at 12:51

GoogleCodeExporter commented 8 years ago
Hi Alex,
I have no idea whether weak value dictionaries are thread safe, but more 
importantly
I need to see the code you are running to generate this error message.  Please 
post
the script that generates this error.  And what is the error class?  I don't 
see a
normal error class (e.g. "ValueError") in this message.  

The tests in blast_test.py run fine with no threading problem as you reported, 
so
what are you doing differently?  Are you following that example code?  Does 
that test
(cd pygr/tests; python blast_test.py) run fine on your platform?

-- Chris

Original comment by cjlee...@gmail.com on 31 Mar 2009 at 5:38

GoogleCodeExporter commented 8 years ago
Code:

import blast,seqdb

seqDB = seqdb.BlastDB('1.TCA.454Reads.fna')
refDB = seqdb.BlastDB('pNL43.may13.fasta')

blastMap = blast.BlastMapping(refDB)

z=blastMap(None,al=None,queryDB=seqDB)

Where 1.TCA.454Reads.fna has LOTS of sequences and pNL43.may13.fasta has only 
one.

Full error:

Exception in thread Thread-1:
Traceback (most recent call last):
  File
"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threadin
g.py", line
460, in __bootstrap
    self.run()
  File
"/Users/alexander/inc/pygr_test/pygr/build/lib.macosx-10.5-i386-2.5/pygr/blast.p
y",
line 56, in run
    for seqID, seq in self.seqDB.iteritems():
  File "seqdb.py", line 299, in iteritems
    yield seqID, self[seqID]
  File "seqdb.py", line 314, in __getitem__
    self._weakValueDict[seqID] = s # save in cache.
  File "classutil.py", line 392, in __setitem__
    self.keep_this(v)
  File "classutil.py", line 389, in keep_this
    del self._keepDict[vmin]
KeyError: 004294_0715_3536[0:94]

The exact sequence where the error occurs differs from run to run. 

blast_test.py doesn't run with a strange error:
$ python blast_test.py 
unable to import extension module: cannot import name cnestedlist

I presume that the main difference between blast_test.py and my test is that my
queryDB has many more sequences than the test.

Original comment by alexande...@gmail.com on 31 Mar 2009 at 6:33

GoogleCodeExporter commented 8 years ago
I added 

print threading.currentThread().getName(), repr(vmin) just before
del self._keepDict[vmin]

the result confirms that vmin is being deleted twice. Eg. the following two 
lines are
adjacent in the output
...
Thread-1 003988_0557_2026[0:104]
MainThread 003988_0557_2026[0:104]
...
The KeyError follows...

Original comment by alexande...@gmail.com on 31 Mar 2009 at 6:41

GoogleCodeExporter commented 8 years ago
Hi Alex,
I think you figured it out: RecentValueDictionary doesn't seem to be safe for
multiple threads.  This also explains why the problem only arose with your 
larger
dataset; it   maintains a default cache size of 50 sequences.  Unless you 
process at
least 50 sequences, you won't encounter its cache-clearing mechanism.

Original comment by cjlee...@gmail.com on 31 Mar 2009 at 4:09

GoogleCodeExporter commented 8 years ago
Now this is really strange. I checkout the latest version from git, then run 
python
setup.py build, but the tests won't run. Here's the error

%> python blast_test.py 
unable to import extension module: cannot import name cnestedlist

Any ideas?

This is on Mac OS X 10.5.6.

Original comment by alexande...@gmail.com on 31 Mar 2009 at 6:47

GoogleCodeExporter commented 8 years ago
UPDATE:
When I go to build/lib.macosx-10.5-i386-2.5/pygr directly and do 
import cnestedlist

everything is fine.

Original comment by alexande...@gmail.com on 31 Mar 2009 at 6:49

GoogleCodeExporter commented 8 years ago
UPDATE:

I was able to reproduce the error in the tests by making the test file even 
longer.
Specifically, I replicated the first sequence of sp_all_hbb 1000 times with 
different
ids. The errors occur while at id# 85--700.

Original comment by alexande...@gmail.com on 31 Mar 2009 at 7:11

GoogleCodeExporter commented 8 years ago
try:
  del self._keepDict[vmin]
except KeyError:
  pass

Seems to do the job...

Original comment by alexande...@gmail.com on 31 Mar 2009 at 8:01

GoogleCodeExporter commented 8 years ago
Hi Alex,
just trapping the KeyError doesn't address the broader issue of thread-safety.  
There
are all sorts of ways multiple threads could misbehave in any cache lookup / 
updating
of RecentValueDict or other cache classes.  We should either really make them
thread-safe with locking, or declare explicitly that Pygr 0.8 doesn't support
multi-threading.  See my discussion group posting about this:
http://groups.google.com/group/pygr-dev/msg/9edcfd676b255668

-- Chris

Original comment by cjlee...@gmail.com on 31 Mar 2009 at 8:46

GoogleCodeExporter commented 8 years ago
this clearly needs more work, starting with a decision about whether we want 
Pygr 0.8
to support multi-threading or not.

Original comment by cjlee...@gmail.com on 31 Mar 2009 at 8:48

GoogleCodeExporter commented 8 years ago
OK, I replaced threading with a solution that uses temporary files for 
communication
to / from the subprocess.  This eliminates all deadlock issues with no need for
threading.  For details, see
http://groups.google.com/group/pygr-dev/msg/e0b13465171a6d6d

Alex, please test this on your multi-sequence blast example that was crashing 
before...

Original comment by cjlee...@gmail.com on 2 Apr 2009 at 7:07

GoogleCodeExporter commented 8 years ago
Slower than the threading solution, but works without error.

Original comment by alexande...@gmail.com on 2 Apr 2009 at 7:59