Renaming protein headers with eggnog gff information.

eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment

GNU Affero General Public License v3.0

551 stars 107 forks source link

Hello,

I am using eggnog to functionally annotate proteins of 276 species. After using EggNog, the protein headers do not have any information about the gene.

For example, this is what my protein header looks like for one species:

head Spodoptera_frugiperda.fa

file_1_file_1_g22553.t1 gene=file_1_file_1_g22553 MNRLGMIVDLSHVGENTTRAAIKLSRAPVVFTHSSVYSLCNHKRNVPDDIIHSLKENGGIIMVNFFPDFVKCAPNATISDVAEHFHYIKRMVGADYVGIGGDFDGVNRVPRGLEDVSRYPELFAELLRSGQWTVQELKNLAGLNMLRVMRQVEKVRDEMRTNGVEPEEHPDSPNDNGNCTSNAFYTEYV

The file from Eggnog annotation has the following:

head Spodoptera_frugiperda.softmasked.prot.fa.emapper.annotations

file_1_file_1_g22553.t1 13037.EHJ66618 2.39e-121 357.0 COG2355@1|root,KOG4127@2759|Eukaryota,38D9H@33154|Opisthokonta,3BCAM@33208|Metazoa,3CRIG@33213|Bilateria,41U16@6656|Arthropoda,3SJQR@50557|Insecta,4488J@7088|Lepidoptera 6656|Arthropoda O Membrane dipeptidase (Peptidase family M19) - - 3.4.13.19 ko:K01273 - - ko00000,ko00537,ko01000,ko01002,ko04147 - - - Peptidase_M19

How do I replace all the headers in the protein file with the gene names (in the case of the example):

gene=file_1_file_1_g22553 to Peptidase_M19

Dear @nitinra ,

There are many different ways to do what you want. However, you need first to be sure which field you want to include in the header, since it seems from your example that you want to include the PFAM domain name?

Regarding how to do it, if you use python you may use Biopython to parse your fasta file, then parse the annotations file, and then generate a new fasta adding the fields you need. Another way, using bash, would be making a table file from your fasta file, with header sequence name as one column, additional header fields as another column and the sequence as a third column. Then, you may use join to merge the annotations file and the table file. The last step would be the opposite to the first step: transform your new table to a new fasta file including the annotations you need.

Just my 2 cents.

Best, Carlos

eggnogdb / eggnog-mapper

Renaming protein headers with eggnog gff information. #440