eftsung / pygr

Automatically exported from code.google.com/p/pygr
0 stars 0 forks source link

Refactor BLAST support as a mapping #34

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Currently this is provided via two methods on BlastDB: blast() and megablast().

In the spirit of generalizing this to follow the "Pygr pattern", this
should instead be a mapping, specifically a Pygr graph interface that takes
sequences as source nodes, returns homologous sequences as destination nodes.

BlastDB can keep its blast() and megablast() methods, for purposes of
backwards compatibility.

Original issue reported on code.google.com by cjlee...@gmail.com on 11 Sep 2008 at 4:10

GoogleCodeExporter commented 8 years ago
proposal: treat BLAST functionality as a mapping, just like any other Pygr 
mapping. 
That means a graph-like interface, i.e.

for target in blastmap[myquery]:
    do something...

or 

for src,target,edge in blastmap[myquery].edges():
    do something...

Seems like this would be adequate for most uses...
This usage implies the blastdb.__getitem__() returns not an
NLMSA but instead an NLMSASlice.  If we go this route, we could
easily add a property to the NLMSASlice that gets the original
NLMSA, allowing the user to requery with subintervals of the original
query sequence.  Seems like a reasonable idea.

If you want to use BlastDB in the usual way, e.g. to store many
different blast results in one NLMSA, just use the blast() method
in the old way.  That preserves flexibility.

The major change to the BlastDB class is that it becomes a wrapper
instead of a subclass of SequenceFileDB.  Its __getitem__() stops
acting as a database id lookup, and turns into a mapping of a query
sequence to its homologs as outlined above.

This raises an interesting schema representation issue: this doesn't
conform to a strict sourceDB -> targetDB schema.  Instead it maps
<any sequence> -> targetDB.  How will we represent that in pygr.Data
schema?

Interface change: if you have a sequence DB you create a mapping
object by constructing the appropriate mapping, e.g.

blastmap = BlastMapping(seqDB)

Now you use it either in the mapping style:

for src,target,edge in blastmap[myquery].edges():
    do something...

or you use it as a callable...

for myquery in lotsaQueries:
    blastmap(myquery, al=myNLMSA)
myNLMSA.build()

Seems pretty straightforward.

Original comment by cjlee...@gmail.com on 15 Dec 2008 at 9:41

GoogleCodeExporter commented 8 years ago
Implemented this by pulling blast functionality out into a new module 
pygr.blast. 
Updated the docs to reflect this.  Key points:

- SequenceFileDB is the new standard database class to use
- BlastDB is deprecated; use SequenceFileDB instead.  It is still provided, for
backwards compatibility with old code and data stored in pygr.Data.  It is just 
a
subclass of SequenceFileDB that provides some convenience methods for the 
old-style
blast interface.

Added one new test to the test suite that tests the new BlastMapping interface.

Original comment by cjlee...@gmail.com on 17 Dec 2008 at 9:49

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 21 Feb 2009 at 2:05

GoogleCodeExporter commented 8 years ago
Hi Titus,
could you verify the BlastMapping and BlastxMapping functionality?  We have 
tests for
both in blast_test.py.  (I wrote the original tests and Istvan rewrote them).  
If you
think these tests are adequate, please close this issue.

Thanks!

Chris

Original comment by cjlee...@gmail.com on 4 Mar 2009 at 11:49

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 13 Mar 2009 at 12:57

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 13 Mar 2009 at 12:58

GoogleCodeExporter commented 8 years ago
> If we go this route, we could
> easily add a property to the NLMSASlice that gets the original
> NLMSA, allowing the user to requery with subintervals of the original
> query sequence.  Seems like a reasonable idea.

I was looking for this feature for quite some time. Makes life easier. However, 
in my
opinion the requerying should eventually happen on the level of NLMSASlice. 
Also one
thing I don't get is why NLMSA and NLMSASlice have different interfaces. 
Intuitively
they are the same thing: collections of interval tuples. This is especially 
true for
directional NLMSAs, but could also be envisaged for undirectional ones.

BTW, requerying the NLMSASlice is a kind of a JOIN query: e.g. 
sl1 = msa[s1[a:b]] 
sl2 = sl1[sl1[c:b]]

could be the same as:
overlap(msa[s1[a:b]], msa[s1[c:b]])

One could also think of doing
sl1 = msa[sl[a:b]]
sl2 = sl1[s2[c:b]]

as
overlap(msa[sl1[a:b]], msa[s2[c:b]])
Note here we trying to JOIN two NLMSASlices created via queries of totally 
different
things.

Original comment by alexande...@gmail.com on 31 Mar 2009 at 8:21

GoogleCodeExporter commented 8 years ago
I approve of this patch.  We should repost Alex's comments as another issue, 
though.

Original comment by the.good...@gmail.com on 7 Sep 2009 at 12:16

GoogleCodeExporter commented 8 years ago
Also see issue #44 and #40.

Original comment by the.good...@gmail.com on 7 Sep 2009 at 12:18