Trinotate / Trinotate.github.io

web documentation for Trinotate
48 stars 17 forks source link

prot_id and related columns do not export #67

Open FarrahKhan1400 opened 2 years ago

FarrahKhan1400 commented 2 years ago

Hi Brian,

My annotation report is generated but it lacks protein id and related fields. Below are the fields that are being omitted:

prot_id prot_coords sprot_Top_BLASTP_hit uniref90_BLASTP nr_BLASTP Pfam gene_ontology_BLASTP gene_ontology_Pfam

All information related to transcripts and blastx searches have exported just fine.

I have used trinotate using non-Trinity data. The only thing I can think of is maybe the names of the transcripts and corresponding protein names in the transdecoder file is causing an issue because they are quite long but I'm not sure how.

Below is an example of the gene trans map file:

PB.1|10004_0|path0:0-565(+)|transcript/182163 PB.1.1|10004_0|path0:0-565(+)|transcript/182163 PB.2|10004_0|path2:6-578(+)|transcript/182279 PB.2.1|10004_0|path2:6-578(+)|transcript/182279 PB.3|10005_0|path2:0-1754(+)|transcript/129349 PB.3.1|10005_0|path2:0-1754(+)|transcript/129349 PB.3|10005_0|path2:1-1852(+)|transcript/114907 PB.3.2|10005_0|path2:1-1852(+)|transcript/114907 PB.3|10005_0|path2:6-1781(+)|transcript/124569 PB.3.5|10005_0|path2:6-1781(+)|transcript/124569

Below is an example of the headers from the transdecoder.pep file

PB.10.1|10011_0|path0:0-1195(+)|transcript/162142.p1 GENE.PB.10.1|10011_0|path0:0-1195(+)|transcript/162142~~PB.10.1|10011_0|path0:0-1195(+)|transcript/162142.p1 ORF type:complete len:301 (+),score=56.25 PB.10.1|10011_0|path0:0-1195(+)|transcript/162142:51-953(+) PB.10.2|10011_0|path0:10-1033(+)|transcript/169696.p1 GENE.PB.10.2|10011_0|path0:10-1033(+)|transcript/169696~~PB.10.2|10011_0|path0:10-1033(+)|transcript/169696.p1 ORF type:complete len:301 (+),score=58.37 PB.10.2|10011_0|path0:10-1033(+)|transcript/169696:41-943(+) PB.100.1|10106_0|path1:0-1851(+)|transcript/107029.p1 GENE.PB.100.1|10106_0|path1:0-1851(+)|transcript/107029~~PB.100.1|10106_0|path1:0-1851(+)|transcript/107029.p1 ORF type:complete len:467 (+),score=34.48 PB.100.1|10106_0|path1:0-1851(+)|transcript/107029:369-1769(+)

Please advise if possible

Kind regards, Farrah Khan

brianjohnhaas commented 2 years ago

hi,

I also think the complexity of the fasta headers is throwing off Trinotate when it tries to extract info it needs from the transdecoder protein fasta header.

The gene-trans-map file also looks like it's missing something to distinguish gene from transcript identifiers, but that might not be impacting the protein match reporting issue.

If you can privately share your Trinotate sqlite database and the protein-related input files, I can see if I can make it work. Otherwise, you could try again with simpler identifiers for the initial transcripts fasta file (eg. the fasta header could include just PB.1|10004_0|path0:0-565(+)|transcript/182163 modified to PB_1_10004_0_path0_0_565_transcript_182163 to deal with non-word characters.

You could start by taking your initial transcript file and doing the following:

cat transcripts.fasta | cut -f 1 -d ' ' | perl -lane 's/\W+/_/g; print;'

new.transcripts.fasta

and then do everything with the new.transcripts.fasta

~b

On Thu, Apr 21, 2022 at 8:02 AM FarrahKhan1400 @.***> wrote:

Hi Brian,

My annotation report is generated but it lacks protein id and related fields. Below are the fields that are being omitted:

prot_id prot_coords sprot_Top_BLASTP_hit uniref90_BLASTP nr_BLASTP Pfam gene_ontology_BLASTP gene_ontology_Pfam

All information related to transcripts and blastx searches have exported just fine.

I have used trinotate using non-Trinity data. The only thing I can think of is maybe the names of the transcripts and corresponding protein names in the transdecoder file is causing an issue because they are quite long but I'm not sure how.

Below is an example of the gene trans map file:

PB.1|10004_0|path0:0-565(+)|transcript/182163 PB.1.1|10004_0|path0:0-565(+)|transcript/182163 PB.2|10004_0|path2:6-578(+)|transcript/182279 PB.2.1|10004_0|path2:6-578(+)|transcript/182279 PB.3|10005_0|path2:0-1754(+)|transcript/129349 PB.3.1|10005_0|path2:0-1754(+)|transcript/129349 PB.3|10005_0|path2:1-1852(+)|transcript/114907 PB.3.2|10005_0|path2:1-1852(+)|transcript/114907 PB.3|10005_0|path2:6-1781(+)|transcript/124569 PB.3.5|10005_0|path2:6-1781(+)|transcript/124569

Below is an example of the headers from the transdecoder.pep file

PB.10.1|10011_0|path0:0-1195(+)|transcript/162142.p1 GENE.PB.10.1|10011_0|path0:0-1195(+)|transcript/162142PB.10.1|10011_0|path0:0-1195(+)|transcript/162142.p1 ORF type:complete len:301 (+),score=56.25 PB.10.1|10011_0|path0:0-1195(+)|transcript/162142:51-953(+) PB.10.2|10011_0|path0:10-1033(+)|transcript/169696.p1 GENE.PB.10.2|10011_0|path0:10-1033(+)|transcript/169696PB.10.2|10011_0|path0:10-1033(+)|transcript/169696.p1 ORF type:complete len:301 (+),score=58.37 PB.10.2|10011_0|path0:10-1033(+)|transcript/169696:41-943(+) PB.100.1|10106_0|path1:0-1851(+)|transcript/107029.p1 GENE.PB.100.1|10106_0|path1:0-1851(+)|transcript/107029~~PB.100.1|10106_0|path1:0-1851(+)|transcript/107029.p1 ORF type:complete len:467 (+),score=34.48 PB.100.1|10106_0|path1:0-1851(+)|transcript/107029:369-1769(+)

Please advise if possible

Kind regards, Farrah Khan

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

FarrahKhan1400 commented 2 years ago

The gene-trans-map file also looks like it's missing something to distinguish gene from transcript identifiers, but that might not be impacting the protein match reporting issue.

Just a note on the gene-trans-map file in case it is important. See below for the examples. PB.3. is considered the gene and PB.3.1, PB.3.2 and PB.3.5 etc are the isoforms. I have generated the Trinotate.xls file using all the isoforms and will filter for the 'best gene' afterwards.

PB.1|10004_0|path0:0-565(+)|transcript/182163 PB.1.1|10004_0|path0:0-565(+)|transcript/182163 PB.2|10004_0|path2:6-578(+)|transcript/182279 PB.2.1|10004_0|path2:6-578(+)|transcript/182279 PB.3|10005_0|path2:0-1754(+)|transcript/129349 PB.3.1|10005_0|path2:0-1754(+)|transcript/129349 PB.3|10005_0|path2:1-1852(+)|transcript/114907 PB.3.2|10005_0|path2:1-1852(+)|transcript/114907 PB.3|10005_0|path2:6-1781(+)|transcript/124569 PB.3.5|10005_0|path2:6-1781(+)|transcript/124569

If you can privately share your Trinotate sqlite database and the protein-related input files, I can see if I can make it work. Otherwise, you could try again with simpler identifiers for the initial transcripts fasta file (eg. the fasta header could include just PB.1|10004_0|path0:0-565(+)|transcript/182163 modified to PB_1_10004_0_path0_0_565_transcript_182163 to deal with non-word characters.

You could start by taking your initial transcript file and doing the following:

cat transcripts.fasta | cut -f 1 -d ' ' | perl -lane 's/\W+/_/g; print;'

The above code did not work for me, the error came in when using perl. I did not investigate it much as I am not familiar with perl, but using my own version to modify PB.1|10004_0|path0:0-565(+)|transcript/182163 to PB_1_10004_0_path0_0_565_transcript_182163 as you instructed and re-running the pipeline worked just fine.

Thanks for quick response and assistance.

Kind regards, Farrah