Extracting GO/KEGG (other functional annotations) using OGs

Hi there,

I'm trying to put together a functional database (for ease of use) for about 24K pangenomes (gene sets). I carried out emapper and now I have the annotation tables for each gene set. Now I want to make it a database (for starters lets say as a csv file). Here are my questions/doubts:

I want to associate the eggNOG OGs to any of the other functional annotations like the GO/KEGG terms. The version of emapper I use is 2.0.1 and diamond based translational search and bummer I do not have a best OGs in the file [#175 ]. I know that the final OG is related to the highest taxa level in the homologous search [#42 #181 #146 ]. Can you help to sort the OGs column in order to pick the actual (final) annotationed OG used for the gene?
After picking an OG, I want to fall back onto a table to pick their GO/KEGG terms. In you latest database I do not find anything of the sorts. Can you tell me how I can retrieve a GO/KEGG term related to the OG?
I want to run an individual HMMer search against translated reads but I also want to use the eggNOG HMMer database (of course so that I can use the OGs for later). Here I see [http://eggnog5.embl.de/#/app/downloads] that I have to manually download HMM databases for each taxa level for each kingdom. Those are a lot of files [#287 #261]. Is there an easy way that I can download all the HMM databases and I will concatenate them for my search?

Hi @shubavarshini ,

Hi there,

I'm trying to put together a functional database (for ease of use) for about 24K pangenomes (gene sets). I carried out emapper and now I have the annotation tables for each gene set. Now I want to make it a database (for starters lets say as a csv file). Here are my questions/doubts:

I want to associate the eggNOG OGs to any of the other functional annotations like the GO/KEGG terms. The version of emapper I use is 2.0.1 and diamond based translational search and bummer I do not have a best OGs in the file [#175 ]. I know that the final OG is related to the highest taxa level in the homologous search [#42 #181 #146 ]. Can you help to sort the OGs column in order to pick the actual (final) annotationed OG used for the gene?

I am afraid there is no easy way to extract the "best OG" in version 2.0.1. Note that it is not always the OG with the highest taxa level, but the one with the highest taxa level which matches one of the taxa levels in the taxonomic scope you are using (auto, probably). My suggestion would be that you pick OGs at a tax level that you consider relevant for your data (Bacteria, for example).

After picking an OG, I want to fall back onto a table to pick their GO/KEGG terms. In you latest database I do not find anything of the sorts. Can you tell me how I can retrieve a GO/KEGG term related to the OG?

Lists of GOs and KEGGs are reported as annotation out for the "best OG". If you want to retrieve all the GO/KEGG terms within an OG you may need to query the database yourself, which could be no simple, since you first would need to retrieve all the proteins (including orthologs, paralogs, etc?) for each OG, and then the annotation for those proteins. Of course, you could just query your OG in http://eggnogdb.embl.de/, but it is not designed for large datasets, if I am not mistaken.

I want to run an individual HMMer search against translated reads but I also want to use the eggNOG HMMer database (of course so that I can use the OGs for later). Here I see [http://eggnog5.embl.de/#/app/downloads] that I have to manually download HMM databases for each taxa level for each kingdom. Those are a lot of files [#287 #261]. Is there an easy way that I can download all the HMM databases and I will concatenate them for my search?

You may use version 2.1.2 to download the HMMER databases with download_eggnog_data.py -H -d taxid (https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.2#Setup) but for all the kingdoms is going to take time to download and setup the databases.

I hope this helps.

Best, Carlos

Thank you Carlos @Cantalapiedra for the reply. I still have more questions. I tried to look it up in your papers and online but I'm still clueless.

I want to associate the eggNOG OGs to any of the other functional annotations like the GO/KEGG terms. The version of emapper I use is 2.0.1 and diamond based translational search and bummer I do not have a best OGs in the file [#175 ]. I know that the final OG is related to the highest taxa level in the homologous search [#42 #181 #146 ]. Can you help to sort the OGs column in order to pick the actual (final) annotationed OG used for the gene?

I am afraid there is no easy way to extract the "best OG" in version 2.0.1. Note that it is not always the OG with the highest taxa level, but the one with the highest taxa level which matches one of the taxa levels in the taxonomic scope you are using (auto, probably). My suggestion would be that you pick OGs at a tax level that you consider relevant for your data (Bacteria, for example).

So Will I get the "best OG" using mmseq? Do I get them in the higher versions of eggnog? I tried to re-run a sample set of genes online emapper and I still did not get the best OG column. It was all NA|NA|NA. I also tried running it in version 2.1.2 (latest developed) with mmseq and I kept getting the error "Error running 'mmseqs search': - Aminoacid". Looks like it is unable to process "-" as an amino acid. Here is my commands that I use: inFasta=pan_genome_reference.fa dbPath=./databases/emapper/emapperdb-5.0.1 mmseq_db=.//databases/emapper/emapperdb-5.0.1/mmseqs

emapper.py --cpu 10 -i $inFasta --itype CDS --translate --data_dir $dbPath --output ${outPrefix}_mmseq --output_dir $outDir -m mmseqs --no_annot --no_file_comments --mmseqs_db $mmseq_db emapper.py --annotate_hits_table ${outDir}/${outPrefix}_mmseq.emapper.seed_orthologs --data_dir $dbPath -m no_search --no_file_comments -o ${outPrefix}_mmseq --output_dir $outDir --cpu 10 --dbmem

After picking an OG, I want to fall back onto a table to pick their GO/KEGG terms. In you latest database I do not find anything of the sorts. Can you tell me how I can retrieve a GO/KEGG term related to the OG?

Lists of GOs and KEGGs are reported as annotation out for the "best OG". If you want to retrieve all the GO/KEGG terms within an OG you may need to query the database yourself, which could be no simple, since you first would need to retrieve all the proteins (including orthologs, paralogs, etc?) for each OG, and then the annotation for those proteins. Of course, you could just query your OG in http://eggnogdb.embl.de/, but it is not designed for large datasets, if I am not mistaken.

How do I retrieve all the proteins (including orthologs, paralogs, etc?) for each OG, and then the annotation for those proteins? From the eggnog.db? How does the emapper retrieve it?

Thank you Carlos @Cantalapiedra for the reply. I still have more questions. I tried to look it up in your papers and online but I'm still clueless.

I want to associate the eggNOG OGs to any of the other functional annotations like the GO/KEGG terms. The version of emapper I use is 2.0.1 and diamond based translational search and bummer I do not have a best OGs in the file [#175 ]. I know that the final OG is related to the highest taxa level in the homologous search [#42 #181 #146 ]. Can you help to sort the OGs column in order to pick the actual (final) annotationed OG used for the gene?

I am afraid there is no easy way to extract the "best OG" in version 2.0.1. Note that it is not always the OG with the highest taxa level, but the one with the highest taxa level which matches one of the taxa levels in the taxonomic scope you are using (auto, probably). My suggestion would be that you pick OGs at a tax level that you consider relevant for your data (Bacteria, for example).

So Will I get the "best OG" using mmseq? Do I get them in the higher versions of eggnog? I tried to re-run a sample set of genes online emapper and I still did not get the best OG column. It was all NA|NA|NA. I also tried running it in version 2.1.2 (latest developed) with mmseq and I kept getting the error "Error running 'mmseqs search': - Aminoacid". Looks like it is unable to process "-" as an amino acid. Here is my commands that I use: inFasta=pan_genome_reference.fa dbPath=./databases/emapper/emapperdb-5.0.1 mmseq_db=.//databases/emapper/emapperdb-5.0.1/mmseqs

emapper.py --cpu 10 -i $inFasta --itype CDS --translate --data_dir $dbPath --output ${outPrefix}_mmseq --output_dir $outDir -m mmseqs --no_annot --no_file_comments --mmseqs_db $mmseq_db emapper.py --annotate_hits_table ${outDir}/${outPrefix}_mmseq.emapper.seed_orthologs --data_dir $dbPath -m no_search --no_file_comments -o ${outPrefix}_mmseq --output_dir $outDir --cpu 10 --dbmem

No need to use mmseqs specifically. Just any emapper version >= 2.1.2. The web version was based on 2.0.1 until a few days ago. At present you should get best OG from the web. Regarding the MMseqs error, you should probably remove the "-" from the protein sequences. I guess those sequences are from an alignment? If you doubt whether mmseqs can process "-" you could run the mmseqs command reported by emapper and see the error message in more detail.

After picking an OG, I want to fall back onto a table to pick their GO/KEGG terms. In you latest database I do not find anything of the sorts. Can you tell me how I can retrieve a GO/KEGG term related to the OG?

Lists of GOs and KEGGs are reported as annotation out for the "best OG". If you want to retrieve all the GO/KEGG terms within an OG you may need to query the database yourself, which could be no simple, since you first would need to retrieve all the proteins (including orthologs, paralogs, etc?) for each OG, and then the annotation for those proteins. Of course, you could just query your OG in http://eggnogdb.embl.de/, but it is not designed for large datasets, if I am not mistaken.

How do I retrieve all the proteins (including orthologs, paralogs, etc?) for each OG, and then the annotation for those proteins? From the eggnog.db? How does the emapper retrieve it?

It is not exactly simple. If you really want to see this in detail the better is that you look at the code. In brief, emapper looks for evolutionary events which are identified as speciation events, and retrieves the ortholog proteins from those events.

Best, Carlos

Closing this. Please, reopen or reissue if needed.

eggnogdb / eggnog-mapper

Extracting GO/KEGG (other functional annotations) using OGs #300