TransDecoder / TransDecoder

TransDecoder source
Other
271 stars 59 forks source link

Improve documentation describing --retain_blastp_hits #15

Closed mmokrejs closed 8 years ago

mmokrejs commented 8 years ago
$ TransDecoder.Predict 
No transcript file (-t)

NAME
    Transdecoder <http://transdecoder.sourceforge.net> - Transcriptome
    Protein Prediction

USAGE
    Required:

     -t <string>                            transcripts.fasta

    Common options:

     --retain_long_orfs <int>               retain all ORFs found that are equal or longer than these many nucleotides even if no other evidence 
                                             marks it as coding (default: 900 bp => 300aa)

     --retain_pfam_hits <string>                 /path/to/pfam_db.hmm to search 
                                            using hmmscan (which should be accessible via your PATH setting)

     --retain_blastp_hits <string>

    Advanced options

     --train <string>                       FASTA file with ORFs to train Markov Mod for protein identification; otherwise 
                                            longest non-redundant ORFs used

     -T <int>                               If no --train, top longest ORFs to train Markov Model (hexamer stats) (default: 500)

$

Please make it clear what --retain_blastp_hits really does. Per https://groups.google.com/d/msgid/transdecoder-users/86b76db3-3a4e-438f-9750-c9c0d45baa26%40googlegroups.com it seems --use-blastp hits would be more appropriate to emphasize not all hits from the blastp.myfile.outfmt6 file will be used. As you say, only those which would otherwise be not included in result make it into the results via this option. Still it does not say how the blastp.myfile.outfmt6 file is filtered.

The page http://transdecoder.github.io is maybe clearer but it does NOT say what happens if ORF is found ab initio on one strand and a blastp hits on opposite strand. I want a feature that Transdecoder does not inject bad blastp predicted protein if there is longer ORF on the opposite strand. In the past clearly some "partial proteins" or "unknown proteins" were annotated on a minus strand of a transcript (lets say in 3'-UTR region, see 'X' below), with some protein-like sequence derived from it. However, there is a clear protein coding gene on the opposite strand.

                 true gene                                     mispredicted
5'-ATGCAGCGACTGTGCGTCAAACGCGACCTGTGANNNNNNNNNTCAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCATNN-3'

I think Transdecoder could only accept such blastp matches if they are on same strand and also ensuring that they overlap with the ab initio predicted ORF found by Transdecoder.Longorfs. In this way it would inherit information for the true gene and ignore the false match to mispredicted gene. I crafted above for the mispredicted gene its own START and STOP codons but lets assume transdecoder will not find this complete ORF (TCAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCAT), only the blastp match would claim something could get translated.

http://transdecoder.github.io

The outputs generated above can be leveraged by TransDecoder to ensure that those peptides with blast hits or domain hits are retained in the set of reported likely coding regions. The final coding region predictions will now include both those regions that have sequence characteristics consistent with coding regions in addition to those that have demonstrated blast homology or pfam domain content.

brianjohnhaas commented 8 years ago

Documentation is now updated. In essence, any ORF with either a pfam or blastp match will be retained in the final output, regardless of its coding score.