$ TransDecoder.Predict
No transcript file (-t)
NAME
Transdecoder <http://transdecoder.sourceforge.net> - Transcriptome
Protein Prediction
USAGE
Required:
-t <string> transcripts.fasta
Common options:
--retain_long_orfs <int> retain all ORFs found that are equal or longer than these many nucleotides even if no other evidence
marks it as coding (default: 900 bp => 300aa)
--retain_pfam_hits <string> /path/to/pfam_db.hmm to search
using hmmscan (which should be accessible via your PATH setting)
--retain_blastp_hits <string>
Advanced options
--train <string> FASTA file with ORFs to train Markov Mod for protein identification; otherwise
longest non-redundant ORFs used
-T <int> If no --train, top longest ORFs to train Markov Model (hexamer stats) (default: 500)
$
Please make it clear what --retain_blastp_hits really does. Per https://groups.google.com/d/msgid/transdecoder-users/86b76db3-3a4e-438f-9750-c9c0d45baa26%40googlegroups.com it seems --use-blastp hits would be more appropriate to emphasize not all hits from the blastp.myfile.outfmt6 file will be used. As you say, only those which would otherwise be not included in result make it into the results via this option. Still it does not say how the blastp.myfile.outfmt6 file is filtered.
The page http://transdecoder.github.io is maybe clearer but it does NOT say what happens if ORF is found ab initio on one strand and a blastp hits on opposite strand. I want a feature that Transdecoder does not inject bad blastp predicted protein if there is longer ORF on the opposite strand. In the past clearly some "partial proteins" or "unknown proteins" were annotated on a minus strand of a transcript (lets say in 3'-UTR region, see 'X' below), with some protein-like sequence derived from it. However, there is a clear protein coding gene on the opposite strand.
I think Transdecoder could only accept such blastp matches if they are on same strand and also ensuring that they overlap with the ab initio predicted ORF found by Transdecoder.Longorfs. In this way it would inherit information for the true gene and ignore the false match to mispredicted gene. I crafted above for the mispredicted gene its own START and STOP codons but lets assume transdecoder will not find this complete ORF (TCAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCAT), only the blastp match would claim something could get translated.
The outputs generated above can be leveraged by TransDecoder to ensure that those peptides with blast hits or domain hits are retained in the set of reported likely coding regions. The final coding region predictions will now include both those regions that have sequence characteristics consistent with coding regions in addition to those that have demonstrated blast homology or pfam domain content.
Documentation is now updated. In essence, any ORF with either a pfam or blastp match will be retained in the final output, regardless of its coding score.
Please make it clear what --retain_blastp_hits really does. Per https://groups.google.com/d/msgid/transdecoder-users/86b76db3-3a4e-438f-9750-c9c0d45baa26%40googlegroups.com it seems --use-blastp hits would be more appropriate to emphasize not all hits from the blastp.myfile.outfmt6 file will be used. As you say, only those which would otherwise be not included in result make it into the results via this option. Still it does not say how the blastp.myfile.outfmt6 file is filtered.
The page http://transdecoder.github.io is maybe clearer but it does NOT say what happens if ORF is found ab initio on one strand and a blastp hits on opposite strand. I want a feature that Transdecoder does not inject bad blastp predicted protein if there is longer ORF on the opposite strand. In the past clearly some "partial proteins" or "unknown proteins" were annotated on a minus strand of a transcript (lets say in 3'-UTR region, see 'X' below), with some protein-like sequence derived from it. However, there is a clear protein coding gene on the opposite strand.
I think Transdecoder could only accept such blastp matches if they are on same strand and also ensuring that they overlap with the ab initio predicted ORF found by Transdecoder.Longorfs. In this way it would inherit information for the
true gene
and ignore the false match tomispredicted
gene. I crafted above for the mispredicted gene its own START and STOP codons but lets assume transdecoder will not find this complete ORF (TCAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCAT
), only the blastp match would claim something could get translated.http://transdecoder.github.io
The outputs generated above can be leveraged by TransDecoder to ensure that those peptides with blast hits or domain hits are retained in the set of reported likely coding regions. The final coding region predictions will now include both those regions that have sequence characteristics consistent with coding regions in addition to those that have demonstrated blast homology or pfam domain content.