eftsung / pygr

Automatically exported from code.google.com/p/pygr
0 stars 0 forks source link

tblastn and blastx support #44

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Right now pygr is restricted to 1:1 alignment relations, which works fine
for blastn and blastp, but not tblastn (protein query vs. nucleotide
database translated to protein sequence) or blastx (nucleotide query vs.
protein database).

tblastn and blastx are problematic for several reasons:
- the returned alignment is not of the actual query sequence and database
sequences, but instead of a *translation* (possibly after
reverse-complementing!) of one side or the other.  Thus the alignment
results are NOT in the coordinate system of the query and the database
seqs; instead they involve a new coordinate system (a translation) created
on the fly.

- this involve a 3:1 alignment relation between nucleotide vs. protein
sequence.  This is problematic in all sorts of ways, the most fundamental
of which is how to robustly represent the reading frame "phase" for any
given part of the alignment (i.e. the ability to represent alignment to a
"partial codon", which can easily occur when aligning protein against exons
which may split a single codon across an exon-exon junction.

- I think tblastn/blastx imply the need a separate coordinate system for
this nucleotide vs. protein alignment problem.  For example, what if the
query is a nucleotide sequence and finds a reverse-complement homology to a
protein sequence?  I.e. when the query is reverse-complemented, it has a
translated-homology to the protein sequence.  The result of any alignment
query must always be returned in the same orientation as the user-supplied
query, which means that the homologous protein interval must be returned in
"negative orientation" -- which of course does not exist for a true protein
sequence.

POSSIBLE SOLUTIONS:

I think this would be easy to resolve by using an annotation to represent
the open reading frame on the protein sequence. The key idea is that an
annotation is an independent coordinate system, but can be converted to the
corresponding sequence interval by requesting its sequence attribute.  So
we could have tblastn return 1:1 alignments of nucleotide sequence to an
ORF annotation (whose coordinate system would be expressed in bp, not aa).
 The user would request its sequence attribute to obtain the corresponding
protein sequence interval.  This would work well in both directions (i.e.
tblastn, and blastx).

The ORF annotation idea solves the "intermediate coordinate system" problem
nicely: it is a nucleotide coordinate system (which can correctly represent
either orientation).  But it is bound to the protein sequence that it
represents, and you can always convert a slice of an ORF annotation to the
corresponding slice of protein sequence by simply accessing its "sequence"
attribute.  We could even map such ORF annotations directly onto genomic
sequence.

Original issue reported on code.google.com by cjlee...@gmail.com on 27 Sep 2008 at 1:16

GoogleCodeExporter commented 8 years ago
Another possibility:
- since tblastn and blastx return protein-protein alignments, it might be more
intuitive for users to get back this protein-protein alignment.  One of the two
aligned protein sequences (dest for tblastn, and src for blastx) would be a
translation of the actual database sequence (nucleotide sequence).  We could 
model
this protein sequence as an annotation whose sequence attribute would 
correspond to
the nucleotide sequence interval that encodes it.

- This is pretty easy to implement, because the process_blast() method takes a
database argument for retrieving each sequence ID.  That database could be an
annotation db created specifically for translations.

- Question: how exactly to handle the coordinates returned by BLAST?  Whereas it
returns a protein sequence, it gives nucleotide coordinates...  That implies a 
bit
more work.

Original comment by cjlee...@gmail.com on 18 Dec 2008 at 10:32

GoogleCodeExporter commented 8 years ago
tblastn case is simplest:
- protein query vs. nucleotide db, returned as protein-protein alignments
- for each hit, just consolidate all the subject intervals into a single ORF
annotation (TranslationAnnot), and save the alignment of the query protein to 
slices
of the TranslationAnnot.
- user accesses the alignment results as usual, but gets one added feature: the
aligned protein intervals have a sequence attribute that yields the associated
nucleotide sequence interval from the nucleotide database.

Simple, clean.  

By contrast, blastx poses some hard problems, because we no longer have *one* 
query
sequence as our alignment interface naturally assumes.  The query sequence 
might be
represented by any number of different ORFs, some of which might show up in many
individual blast hits.  It's not clear whether users will want these to be 
reported
as one sequence or different sequences (this is especially tricky if you have
non-overlapping intervals of a common ORF).  Maybe I should just give blastx a
different interface, one that just follows BLAST's interface: a list of hits.  
That
could be quite simple: for each hit just follow the same recipe above of
consolidating into one annotation and reporting aligned slices vs. subject 
database
sequence slices.  Actually tblastx could also be done as a hybrid of the blastx 
+
tblastn solutions I proposed above.

Original comment by cjlee...@gmail.com on 7 Jan 2009 at 4:35

GoogleCodeExporter commented 8 years ago
Titus requested some usage examples, so here goes:

# BASIC USAGE: creating a mapping object for any seqDB
# that you want to search using BLAST
blastmap = BlastMapping(seqDB)
# blastmap acts like a Pygr graph (i.e. dict) 
# that takes any sequence object as a source node (key)
# and maps them to homologous sequence intervals from seqDB
# as target nodes, with edge information about each alignment,
# following the generic pattern blastmap[src][target] = edgeInfo.
# You can use it either in the mapping style:
# myquery can be any Pygr sequence interval object...
for src,target,edge in blastmap[myquery].edges():
    # print the aligned intervals or whatever you want to do...
    print '%s %s\n%s %s\n' %(src,repr(src),target,repr(target))

# or you use it as a callable, so you can pass
# search parameters like expmax etc.
# This also lets you save many search results into
# one NLMSA...
for myquery in lotsaQueries:
    blastmap(myquery, al=myNLMSA)
myNLMSA.build()

Since this uses NLMSA to store the results, any of the NLMSASlice group-by 
operators
can be applied as usual.

BlastMapping determines which flavor of BLAST (blastp or blastn) to run based 
on the
sequence type (nucleotide vs. protein) of the query and database.  Currently, 
tblastn
and blastx won't work correctly, because of the translation issues outlined 
above.  

The new proposal would make tblastn work exactly the same as blastp and blastn 
(i.e.
as in the example code above), but with one extra feature: the aligned target
intervals (which look like protein sequence intervals, even though the target
database is nucleotide) would have a "sequence" attribute which yields the 
associated
nucleotide sequence interval from the target sequence database, e.g. to print 
the
target *nucleotide sequence coordinates* next to the target *protein sequence
string*, you would just change the code trivially:

for src,target,edge in blastmap[myquery].edges():
    # print the aligned intervals or whatever you want to do...
    print '%s %s\n%s %s %2.1f\n' %(src,repr(src),target,
                                   repr(target.sequence),
                                   100.*edge.pIdentity())
This example also prints the percent identity based of the *protein* sequence of
target vs. src.

Explanation: the target sequence interval is not an interval of seqDB of course,
because tblastn translates the subject nucleotide sequence into a protein 
sequence. 
target represents that protein sequence interval; you can slice it, extract its
protein sequence string (by str()) etc. in all the usual ways.  But it is 
actually an
Annotation sequence object, bound to a real sequence object, namely its 
associated
interval of nucleotide sequence from seqDB, which as always with an annotation, 
you
can obtain by requesting the "sequence" attribute.  This takes advantage of the 
fact
that in Pygr an annotation is actually a subclass of sequence (SeqPath), so all 
the
sequence operation methods come along for free...

Original comment by cjlee...@gmail.com on 7 Jan 2009 at 9:49

GoogleCodeExporter commented 8 years ago
implemented BlastxMapping and BlastxResults classes as well as
generate_tblastn_ivals() generator function.

Here's a simple example of blastx usage:

from pygr import seqdb, blast
dna = seqdb.SequenceFileDB('hbb1_mouse.fa')
prot = seqdb.SequenceFileDB('sp_hbb1')
blastmap = blast.BlastxMapping(prot)
results = blastmap[dna['gi|171854975|dbj|AB364477.1|']]
for result in results:
    for src,dest,edge in result.edges():
        print '%s %s\n%s %s\n' %(src,repr(src),dest,repr(dest))

BlastMapping will raise a ValueError if you try to use it for blastx; 
BlastxMapping
will raise a ValueError if you try to use it for anything other than blastx or 
tblastx.

Original comment by cjlee...@gmail.com on 8 Jan 2009 at 8:21

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 21 Feb 2009 at 2:06

GoogleCodeExporter commented 8 years ago
Hi Titus,
would you be willing to be the verifier for this issue (the new tblastn and 
blastx
support)?  There are tests for both functionalities, in blast_test.py.

-- Chris

Original comment by cjlee...@gmail.com on 5 Mar 2009 at 12:03

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 13 Mar 2009 at 1:01

GoogleCodeExporter commented 8 years ago
All good.  Nice work by Chris.

Original comment by the.good...@gmail.com on 2 Sep 2009 at 1:54