Closed GoogleCodeExporter closed 8 years ago
Another possibility:
- since tblastn and blastx return protein-protein alignments, it might be more
intuitive for users to get back this protein-protein alignment. One of the two
aligned protein sequences (dest for tblastn, and src for blastx) would be a
translation of the actual database sequence (nucleotide sequence). We could
model
this protein sequence as an annotation whose sequence attribute would
correspond to
the nucleotide sequence interval that encodes it.
- This is pretty easy to implement, because the process_blast() method takes a
database argument for retrieving each sequence ID. That database could be an
annotation db created specifically for translations.
- Question: how exactly to handle the coordinates returned by BLAST? Whereas it
returns a protein sequence, it gives nucleotide coordinates... That implies a
bit
more work.
Original comment by cjlee...@gmail.com
on 18 Dec 2008 at 10:32
tblastn case is simplest:
- protein query vs. nucleotide db, returned as protein-protein alignments
- for each hit, just consolidate all the subject intervals into a single ORF
annotation (TranslationAnnot), and save the alignment of the query protein to
slices
of the TranslationAnnot.
- user accesses the alignment results as usual, but gets one added feature: the
aligned protein intervals have a sequence attribute that yields the associated
nucleotide sequence interval from the nucleotide database.
Simple, clean.
By contrast, blastx poses some hard problems, because we no longer have *one*
query
sequence as our alignment interface naturally assumes. The query sequence
might be
represented by any number of different ORFs, some of which might show up in many
individual blast hits. It's not clear whether users will want these to be
reported
as one sequence or different sequences (this is especially tricky if you have
non-overlapping intervals of a common ORF). Maybe I should just give blastx a
different interface, one that just follows BLAST's interface: a list of hits.
That
could be quite simple: for each hit just follow the same recipe above of
consolidating into one annotation and reporting aligned slices vs. subject
database
sequence slices. Actually tblastx could also be done as a hybrid of the blastx
+
tblastn solutions I proposed above.
Original comment by cjlee...@gmail.com
on 7 Jan 2009 at 4:35
Titus requested some usage examples, so here goes:
# BASIC USAGE: creating a mapping object for any seqDB
# that you want to search using BLAST
blastmap = BlastMapping(seqDB)
# blastmap acts like a Pygr graph (i.e. dict)
# that takes any sequence object as a source node (key)
# and maps them to homologous sequence intervals from seqDB
# as target nodes, with edge information about each alignment,
# following the generic pattern blastmap[src][target] = edgeInfo.
# You can use it either in the mapping style:
# myquery can be any Pygr sequence interval object...
for src,target,edge in blastmap[myquery].edges():
# print the aligned intervals or whatever you want to do...
print '%s %s\n%s %s\n' %(src,repr(src),target,repr(target))
# or you use it as a callable, so you can pass
# search parameters like expmax etc.
# This also lets you save many search results into
# one NLMSA...
for myquery in lotsaQueries:
blastmap(myquery, al=myNLMSA)
myNLMSA.build()
Since this uses NLMSA to store the results, any of the NLMSASlice group-by
operators
can be applied as usual.
BlastMapping determines which flavor of BLAST (blastp or blastn) to run based
on the
sequence type (nucleotide vs. protein) of the query and database. Currently,
tblastn
and blastx won't work correctly, because of the translation issues outlined
above.
The new proposal would make tblastn work exactly the same as blastp and blastn
(i.e.
as in the example code above), but with one extra feature: the aligned target
intervals (which look like protein sequence intervals, even though the target
database is nucleotide) would have a "sequence" attribute which yields the
associated
nucleotide sequence interval from the target sequence database, e.g. to print
the
target *nucleotide sequence coordinates* next to the target *protein sequence
string*, you would just change the code trivially:
for src,target,edge in blastmap[myquery].edges():
# print the aligned intervals or whatever you want to do...
print '%s %s\n%s %s %2.1f\n' %(src,repr(src),target,
repr(target.sequence),
100.*edge.pIdentity())
This example also prints the percent identity based of the *protein* sequence of
target vs. src.
Explanation: the target sequence interval is not an interval of seqDB of course,
because tblastn translates the subject nucleotide sequence into a protein
sequence.
target represents that protein sequence interval; you can slice it, extract its
protein sequence string (by str()) etc. in all the usual ways. But it is
actually an
Annotation sequence object, bound to a real sequence object, namely its
associated
interval of nucleotide sequence from seqDB, which as always with an annotation,
you
can obtain by requesting the "sequence" attribute. This takes advantage of the
fact
that in Pygr an annotation is actually a subclass of sequence (SeqPath), so all
the
sequence operation methods come along for free...
Original comment by cjlee...@gmail.com
on 7 Jan 2009 at 9:49
implemented BlastxMapping and BlastxResults classes as well as
generate_tblastn_ivals() generator function.
Here's a simple example of blastx usage:
from pygr import seqdb, blast
dna = seqdb.SequenceFileDB('hbb1_mouse.fa')
prot = seqdb.SequenceFileDB('sp_hbb1')
blastmap = blast.BlastxMapping(prot)
results = blastmap[dna['gi|171854975|dbj|AB364477.1|']]
for result in results:
for src,dest,edge in result.edges():
print '%s %s\n%s %s\n' %(src,repr(src),dest,repr(dest))
BlastMapping will raise a ValueError if you try to use it for blastx;
BlastxMapping
will raise a ValueError if you try to use it for anything other than blastx or
tblastx.
Original comment by cjlee...@gmail.com
on 8 Jan 2009 at 8:21
Original comment by mare...@gmail.com
on 21 Feb 2009 at 2:06
Hi Titus,
would you be willing to be the verifier for this issue (the new tblastn and
blastx
support)? There are tests for both functionalities, in blast_test.py.
-- Chris
Original comment by cjlee...@gmail.com
on 5 Mar 2009 at 12:03
Original comment by mare...@gmail.com
on 13 Mar 2009 at 1:01
All good. Nice work by Chris.
Original comment by the.good...@gmail.com
on 2 Sep 2009 at 1:54
Original issue reported on code.google.com by
cjlee...@gmail.com
on 27 Sep 2008 at 1:16