eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
560 stars 105 forks source link

Confusing GO annotations #200

Closed jamesabbott closed 4 years ago

jamesabbott commented 4 years ago

I've been using eggnog_mapper to carry out functional annotation of predicted metagenomic proteins, which are primarily prokaryotic in origin, however a number of the terms which my analysis identifies as being enriched in particular conditions are clearly from higher eukaryotes.

As an example, a protein which is identified as fliC (Flagellin) seems to have the GO terms appropriate from the eggnog entry http://eggnog5.embl.de/#/app/results#COG1344_datamenu which all make sense. It is also annotated as GO:0035681 (toll-like receptor 15 signaling pathway) which is decidedly eukaryotic, and not referenced in the above eggnog entry. As far as I can see, this term is being introduced since there is evidence of an interaction between bacterial flagella and the TOLL signalling pathway, but this term does not apply directly to the annotated protein. As a result of this kind of annotation, I am finding 'enrichment' of various eukaryotic receptor pathways (TOLL, WNT, ERBB) in my metagenomic samples.

Is there any way to restrict the terms reported to those applied directly to the protein function (i.e. in this case, those from COG1344), and not include those which seem to be inferred due to interaction?

Many thanks, James

jhcepas commented 4 years ago

Hi James, could you share a couple of query sequences leading to that problem? that would help us debugging...

On Wed, 6 May 2020 at 22:32, James Abbott notifications@github.com wrote:

I've been using eggnog_mapper to carry out functional annotation of predicted metagenomic proteins, which are primarily prokaryotic in origin, however a number of the terms which my analysis identifies as being enriched in particular conditions are clearly from higher eukaryotes.

As an example, a protein which is identified as fliC (Flagellin) seems to have the GO terms appropriate from the eggnog entry http://eggnog5.embl.de/#/app/results#COG1344_datamenu which all make sense. It is also annotated as GO:0035681 (toll-like receptor 15 signaling pathway) which is decidedly eukaryotic, and not referenced in the above eggnog entry. As far as I can see, this term is being introduced since there is evidence of an interaction between bacterial flagella and the TOLL signalling pathway, but this term does not apply directly to the annotated protein. As a result of this kind of annotation, I am finding 'enrichment' of various eukaryotic receptor pathways (TOLL, WNT, ERBB) in my metagenomic samples.

Is there any way to restrict the terms reported to those applied directly to the protein function (i.e. in this case, those from COG1344), and not include those which seem to be inferred due to interaction? Many thanks, James

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eggnogdb/eggnog-mapper/issues/200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABH6SXU4RWVUKNHS2F72XTRQHCOPANCNFSM4M2YG2OA .

jamesabbott commented 4 years ago

Hi Jaime,

Many thanks for the quick response. I’ve done some further tests comparing the output of my standalone installation with the online version, and it seems the online version is behaving differently

As an advantage, the following sequence is annotated as fliC:

2006_NODE_1409982_length_265_cov_0.904762_1 DIDLKKIDSTSLKLNSLTVSSNALNVSGTIDTVVAASAGSGSQVVSFAAAEVTKLNTANGTSLTASDLSLHEVQNASGAGTGTFVVKA

The output from our local installation with this sequence is as follows:

emapper version: emapper-1.0.3 emapper DB: 4.5.1

command: ./emapper.py -i /cluster/db/jabbott/GO_enrichment/barley_filter/test.fa -o test --output_dir eggnog_annotations --scratch_dir /tmp/418.1.all.q --data_dir /media/ramdisk/418.1.all.q --database bact -m diamond --cpu 12 --go_evidence non-electronic

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 7.8e-16 88.2 FLIC GO:0002218,GO:0002221,GO:0002224,GO:0002253,GO:0002376,GO:0002682,GO:0002684,GO:0002755,GO:0002757,GO:0002758,GO:0002764,GO:0005575,GO:0005576,GO:0005618,GO:0005622,GO:0005623,GO:0005886,GO:0006950,GO:0006952,GO:0006955,GO:0007154,GO:0007165,GO:0008150,GO:0009288,GO:0009987,GO:0016020,GO:0023052,GO:0030312,GO:0031347,GO:0031349,GO:0034134,GO:0034142,GO:0034146,GO:0035681,GO:0042597,GO:0043226,GO:0043228,GO:0043229,GO:0043232,GO:0044424,GO:0044464,GO:0044699,GO:0044700,GO:0044763,GO:0045087,GO:0045088,GO:0045089,GO:0048518,GO:0048583,GO:0048584,GO:0050776,GO:0050778,GO:0050789,GO:0050794,GO:0050896,GO:0051716,GO:0055040,GO:0065007,GO:0071944,GO:0080134 K02406 bactNOG[38] 05I03@bactNOG,0BBTQ@bproNOG,16PW1@proNOG,COG1344@NOG NA|NA|NA N Flagellin domain-containing protein

Whereas having run the sequence through the online version I get:

emapper version: emapper-1.0.3-35-g63c274b emapper DB: 2.0

command: ./emapper.py --cpu 10 -i /data/shared/emapper_jobs/user_data/MM_lfrf1v9s/query_seqs.fa --output query_seqs.fa --output_dir /data/shared/emapper_jobs/user_data/MM_lfrf1v9s -m diamond -d none --tax_scope auto --go_evidence non-electronic --target_orthologs all --seed_ortholog_evalue 0.001 --seed_ortholog_score 60 --query-cover 20 --subject-cover 0 --override --temp_dir /data/shared/emapper_jobs/user_data/MM_lfrf1v9s

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 3.7e-12 77.4 Burkholderiaceae fliC GO:0005575,GO:0005576,GO:0005623,GO:0009288,GO:0042995,GO:0043226,GO:0043228,GO:0044464 ko:K02406 ko02020,ko02040,ko04621,ko04626,ko05132,ko05134,map02020,map02040,map04621,map04626,map05132,map05134 ko00000,ko00001,ko02035 Bacteria 1K01X@119060,1MV1N@1224,2VJTA@28216,COG1344@1,COG1344@2 NA|NA|NA N Flagellin is the subunit protein which polymerizes to form the filaments of bacterial flagella

So the problem GO term (GO:0035681) is not present using the online version. I’ve added the additional arguments supported arguments used by the online version to try to replicate the analysis as closely as possible (although –query-cover and –subject-cover do not seem to be supported):

emapper version: emapper-1.0.3 emapper DB: 4.5.1

command: ./emapper.py -i /cluster/db/jabbott/GO_enrichment/barley_filter/test.fa -o test --output_dir eggnog_annotations --scratch_dir /tmp/420.1.all.q --data_dir /media/ramdisk/420.1.all.q --database none -m diamond --cpu 12 --go_evidence non-electronic --target_orthologs all --seed_ortholog_evalue 0.001 --seed_ortholog_score 60 --override

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 7.8e-16 88.2 FLIC GO:0002218,GO:0002221,GO:0002224,GO:0002253,GO:0002376,GO:0002682,GO:0002684,GO:0002755,GO:0002757,GO:0002758,GO:0002764,GO:0005575,GO:0005576,GO:0005618,GO:0005622,GO:0005623,GO:0005886,GO:0006950,GO:0006952,GO:0006955,GO:0007154,GO:0007165,GO:0008150,GO:0009288,GO:0009987,GO:0016020,GO:0023052,GO:0030312,GO:0031347,GO:0031349,GO:0034134,GO:0034142,GO:0034146,GO:0035681,GO:0042597,GO:0043226,GO:0043228,GO:0043229,GO:0043232,GO:0044424,GO:0044464,GO:0044699,GO:0044700,GO:0044763,GO:0045087,GO:0045088,GO:0045089,GO:0048518,GO:0048583,GO:0048584,GO:0050776,GO:0050778,GO:0050789,GO:0050794,GO:0050896,GO:0051716,GO:0055040,GO:0065007,GO:0071944,GO:0080134 K02406 bactNOG[38] 05I03@bactNOG,0BBTQ@bproNOG,16PW1@proNOG,COG1344@NOG NA|NA|NA N Flagellin domain-containing protein

This still reports the additional GO terms. The most significant difference I can see in the outputs is the database version.

Any suggestions on how to resolve this would be greatly appreciated.

Best Regards James

From: Jaime Huerta-Cepas notifications@github.com Reply to: eggnogdb/eggnog-mapper reply@reply.github.com Date: Wednesday, 6 May 2020 at 21:39 To: eggnogdb/eggnog-mapper eggnog-mapper@noreply.github.com Cc: "James Abbott (Staff)" j.abbott@dundee.ac.uk, Author author@noreply.github.com Subject: Re: [eggnogdb/eggnog-mapper] Confusing GO annotations (#200)

Hi James, could you share a couple of query sequences leading to that problem? that would help us debugging...

On Wed, 6 May 2020 at 22:32, James Abbott notifications@github.com wrote:

I've been using eggnog_mapper to carry out functional annotation of predicted metagenomic proteins, which are primarily prokaryotic in origin, however a number of the terms which my analysis identifies as being enriched in particular conditions are clearly from higher eukaryotes.

As an example, a protein which is identified as fliC (Flagellin) seems to have the GO terms appropriate from the eggnog entry http://eggnog5.embl.de/#/app/results#COG1344_datamenu which all make sense. It is also annotated as GO:0035681 (toll-like receptor 15 signaling pathway) which is decidedly eukaryotic, and not referenced in the above eggnog entry. As far as I can see, this term is being introduced since there is evidence of an interaction between bacterial flagella and the TOLL signalling pathway, but this term does not apply directly to the annotated protein. As a result of this kind of annotation, I am finding 'enrichment' of various eukaryotic receptor pathways (TOLL, WNT, ERBB) in my metagenomic samples.

Is there any way to restrict the terms reported to those applied directly to the protein function (i.e. in this case, those from COG1344), and not include those which seem to be inferred due to interaction? Many thanks, James

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eggnogdb/eggnog-mapper/issues/200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABH6SXU4RWVUKNHS2F72XTRQHCOPANCNFSM4M2YG2OA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/eggnogdb/eggnog-mapper/issues/200#issuecomment-624878446, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSPIMIZDJSAIHDPPI7LRQHDGZANCNFSM4M2YG2OA.

The University of Dundee is a registered Scottish Charity, No: SC015096

Cantalapiedra commented 4 years ago

Dear James,

I have tested your sequence with a local current version and I got:

emapper-2.0.1-96-g92f1e39 ... ./emapper.py -i input.fa -o test

--output_dir tmp --data_dir . -m diamond --cpu 12 --go_evidence non-electronic --target_orthologs all --seed_ortholog_evalue 0.001 --seed_ortholog_score 60 ... 2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 3.7e-12 77.4 Bacteria Burkholderiaceae N Flagellin is the subunit protein which polymerizes to form the filaments of bacterial flagella COG1344@1|root,COG1344@2|Bacteria,1MV1N@1224 |Proteobacteria,2VJTA@28216|Betaproteobacteria,1K01X@119060|Burkholderiaceae fliC GO:0005575,GO:0005576,GO:0005623,GO:0009288,GO:0042995,GO:0043226,GO:0043228,GO:0044464 ko:K02406 ko02020,ko02040,ko04621,ko04626,ko05132,ko05134,map02020,map02040,map04621,map04626,map05132,map05134 ko00000,ko00001,ko02035

It looks equivalent to the one you got with the online version:

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 3.7e-12 77.4 Burkholderiaceae fliC GO:0005575,GO:0005576,GO:0005623,GO:0009288,GO:0042995,GO:0043226,GO:0043228,GO:0044464 ko:K02406 ko02020,ko02040,ko04621,ko04626,ko05132,ko05134,map02020,map02040,map04621,map04626,map05132,map05134 ko00000,ko00001,ko02035 Bacteria 1K01X@119060,1MV1N@1224,2VJTA@28216 ,COG1344@1,COG1344@2 NA|NA|NA N Flagellin is the subunit protein which polymerizes to form the filaments of bacterial flagella

I guess the difference is, as you said, due to the DB being used.

Also, --query-cov and --subject-cov seem to be working for me with this example. I got results with --query-cov 10 --subject-cov 10 but no results with --query-cov 100 --subject-cov 100

Best, Carlos

El jue., 7 may. 2020 a las 14:33, James Abbott (notifications@github.com) escribió:

Hi Jaime,

Many thanks for the quick response. I’ve done some further tests comparing the output of my standalone installation with the online version, and it seems the online version is behaving differently

As an advantage, the following sequence is annotated as fliC:

2006_NODE_1409982_length_265_cov_0.904762_1 DIDLKKIDSTSLKLNSLTVSSNALNVSGTIDTVVAASAGSGSQVVSFAAAEVTKLNTANGTSLTASDLSLHEVQNASGAGTGTFVVKA

The output from our local installation with this sequence is as follows:

emapper version: emapper-1.0.3 emapper DB: 4.5.1

command: ./emapper.py -i

/cluster/db/jabbott/GO_enrichment/barley_filter/test.fa -o test --output_dir eggnog_annotations --scratch_dir /tmp/418.1.all.q --data_dir /media/ramdisk/418.1.all.q --database bact -m diamond --cpu 12 --go_evidence non-electronic

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 7.8e-16 88.2 FLIC GO:0002218,GO:0002221,GO:0002224,GO:0002253,GO:0002376,GO:0002682,GO:0002684,GO:0002755,GO:0002757,GO:0002758,GO:0002764,GO:0005575,GO:0005576,GO:0005618,GO:0005622,GO:0005623,GO:0005886,GO:0006950,GO:0006952,GO:0006955,GO:0007154,GO:0007165,GO:0008150,GO:0009288,GO:0009987,GO:0016020,GO:0023052,GO:0030312,GO:0031347,GO:0031349,GO:0034134,GO:0034142,GO:0034146,GO:0035681,GO:0042597,GO:0043226,GO:0043228,GO:0043229,GO:0043232,GO:0044424,GO:0044464,GO:0044699,GO:0044700,GO:0044763,GO:0045087,GO:0045088,GO:0045089,GO:0048518,GO:0048583,GO:0048584,GO:0050776,GO:0050778,GO:0050789,GO:0050794,GO:0050896,GO:0051716,GO:0055040,GO:0065007,GO:0071944,GO:0080134 K02406 bactNOG[38] 05I03@bactNOG,0BBTQ@bproNOG,16PW1@proNOG,COG1344@NOG NA|NA|NA N Flagellin domain-containing protein

Whereas having run the sequence through the online version I get:

emapper version: emapper-1.0.3-35-g63c274b emapper DB: 2.0

command: ./emapper.py --cpu 10 -i

/data/shared/emapper_jobs/user_data/MM_lfrf1v9s/query_seqs.fa --output query_seqs.fa --output_dir /data/shared/emapper_jobs/user_data/MM_lfrf1v9s -m diamond -d none --tax_scope auto --go_evidence non-electronic --target_orthologs all --seed_ortholog_evalue 0.001 --seed_ortholog_score 60 --query-cover 20 --subject-cover 0 --override --temp_dir /data/shared/emapper_jobs/user_data/MM_lfrf1v9s

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 3.7e-12 77.4 Burkholderiaceae fliC GO:0005575,GO:0005576,GO:0005623,GO:0009288,GO:0042995,GO:0043226,GO:0043228,GO:0044464 ko:K02406 ko02020,ko02040,ko04621,ko04626,ko05132,ko05134,map02020,map02040,map04621,map04626,map05132,map05134 ko00000,ko00001,ko02035 Bacteria 1K01X@119060,1MV1N@1224,2VJTA@28216 ,COG1344@1,COG1344@2 NA|NA|NA N Flagellin is the subunit protein which polymerizes to form the filaments of bacterial flagella

So the problem GO term (GO:0035681) is not present using the online version. I’ve added the additional arguments supported arguments used by the online version to try to replicate the analysis as closely as possible (although –query-cover and –subject-cover do not seem to be supported):

emapper version: emapper-1.0.3 emapper DB: 4.5.1

command: ./emapper.py -i

/cluster/db/jabbott/GO_enrichment/barley_filter/test.fa -o test --output_dir eggnog_annotations --scratch_dir /tmp/420.1.all.q --data_dir /media/ramdisk/420.1.all.q --database none -m diamond --cpu 12 --go_evidence non-electronic --target_orthologs all --seed_ortholog_evalue 0.001 --seed_ortholog_score 60 --override

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 7.8e-16 88.2 FLIC GO:0002218,GO:0002221,GO:0002224,GO:0002253,GO:0002376,GO:0002682,GO:0002684,GO:0002755,GO:0002757,GO:0002758,GO:0002764,GO:0005575,GO:0005576,GO:0005618,GO:0005622,GO:0005623,GO:0005886,GO:0006950,GO:0006952,GO:0006955,GO:0007154,GO:0007165,GO:0008150,GO:0009288,GO:0009987,GO:0016020,GO:0023052,GO:0030312,GO:0031347,GO:0031349,GO:0034134,GO:0034142,GO:0034146,GO:0035681,GO:0042597,GO:0043226,GO:0043228,GO:0043229,GO:0043232,GO:0044424,GO:0044464,GO:0044699,GO:0044700,GO:0044763,GO:0045087,GO:0045088,GO:0045089,GO:0048518,GO:0048583,GO:0048584,GO:0050776,GO:0050778,GO:0050789,GO:0050794,GO:0050896,GO:0051716,GO:0055040,GO:0065007,GO:0071944,GO:0080134 K02406 bactNOG[38] 05I03@bactNOG,0BBTQ@bproNOG,16PW1@proNOG,COG1344@NOG NA|NA|NA N Flagellin domain-containing protein

This still reports the additional GO terms. The most significant difference I can see in the outputs is the database version.

Any suggestions on how to resolve this would be greatly appreciated.

Best Regards James

From: Jaime Huerta-Cepas notifications@github.com Reply to: eggnogdb/eggnog-mapper reply@reply.github.com Date: Wednesday, 6 May 2020 at 21:39 To: eggnogdb/eggnog-mapper eggnog-mapper@noreply.github.com Cc: "James Abbott (Staff)" j.abbott@dundee.ac.uk, Author < author@noreply.github.com> Subject: Re: [eggnogdb/eggnog-mapper] Confusing GO annotations (#200)

Hi James, could you share a couple of query sequences leading to that problem? that would help us debugging...

On Wed, 6 May 2020 at 22:32, James Abbott notifications@github.com wrote:

I've been using eggnog_mapper to carry out functional annotation of predicted metagenomic proteins, which are primarily prokaryotic in origin, however a number of the terms which my analysis identifies as being enriched in particular conditions are clearly from higher eukaryotes.

As an example, a protein which is identified as fliC (Flagellin) seems to have the GO terms appropriate from the eggnog entry http://eggnog5.embl.de/#/app/results#COG1344_datamenu which all make sense. It is also annotated as GO:0035681 (toll-like receptor 15 signaling pathway) which is decidedly eukaryotic, and not referenced in the above eggnog entry. As far as I can see, this term is being introduced since there is evidence of an interaction between bacterial flagella and the TOLL signalling pathway, but this term does not apply directly to the annotated protein. As a result of this kind of annotation, I am finding 'enrichment' of various eukaryotic receptor pathways (TOLL, WNT, ERBB) in my metagenomic samples.

Is there any way to restrict the terms reported to those applied directly to the protein function (i.e. in this case, those from COG1344), and not include those which seem to be inferred due to interaction? Many thanks, James

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eggnogdb/eggnog-mapper/issues/200, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABH6SXU4RWVUKNHS2F72XTRQHCOPANCNFSM4M2YG2OA>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/eggnogdb/eggnog-mapper/issues/200#issuecomment-624878446>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ABIRXSPIMIZDJSAIHDPPI7LRQHDGZANCNFSM4M2YG2OA>.

The University of Dundee is a registered Scottish Charity, No: SC015096

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eggnogdb/eggnog-mapper/issues/200#issuecomment-625227333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIMQ3RKA53NSCJPKCHDWZTRQKTBVANCNFSM4M2YG2OA .

-- Carlos P. Cantalapiedra Post-doctoral researcher Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA) Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA) Campus de Montegancedo-UPM 28223-Pozuelo de Alarcón (Madrid) Spain

jhcepas commented 4 years ago

Thanks @Cantalapiedra for looking into this!

Besides DB updates, I don't think we can solve this kind of missannotations easily, as they seem to be inherited from sequence-based annotations in other DBs.

Note that, unless specifically requested, eggnog-mapper restricts the annotations so there are no cross-domain transfers. This is, a bacterial gene will never get annotations from euk. orthologs. The problem you found is that there was an euk. GO term bound to a bact. sequence, so eggnog-mapper cannot differentiate.

jamesabbott commented 4 years ago

As far as I can see the earliest version of the database available is 4.5.0 - would you be able to make available an earlier database which hasn't pulled in these associated terms? The version 2.0 used by the online version doesn't have the problem annotation for this particular sequence, so may be a good starting point, and would make the output equivalent to having used the online version. Whether this would solve the wider problem or just this particular annotation I guess we have no way of knowing without running and comparing the full dataset. My dataset is pretty big, so would require >200 submissions of 100k sequences to actually run online, which is not really practical.

Thanks, James

Cantalapiedra commented 4 years ago

Not sure if there is some confusion here regarding database versions. Version being used in the online version (which uses eggnog-mapper v2.0) is eggnog v5.0

El lun., 11 may. 2020 a las 10:09, James Abbott (notifications@github.com) escribió:

As far as I can see the earliest version of the database available is 4.5.0 - would you be able to make available an earlier database which hasn't pulled in these associated terms? The version 2.0 used by the online version doesn't have the problem annotation for this particular sequence, so may be a good starting point, and would make the output equivalent to having used the online version. Whether this would solve the wider problem or just this particular annotation I guess we have no way of knowing without running and comparing the full dataset. My dataset is pretty big, so would require >200 submissions of 100k sequences to actually run online, which is not really practical.

Thanks, James

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eggnogdb/eggnog-mapper/issues/200#issuecomment-626544360, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIMQ3UJERPC25ORFT465YDRQ6XEDANCNFSM4M2YG2OA .

-- Carlos P. Cantalapiedra Post-doctoral researcher Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA) Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA) Campus de Montegancedo-UPM 28223-Pozuelo de Alarcón (Madrid) Spain

jamesabbott commented 4 years ago

I’m happy to admit I’m confused! I was going by what appears to be the database version reported by eggnogmapper:

Standalone:

emapper version: emapper-1.0.3 emapper DB: 4.5.1

Online:

emapper version: emapper-1.0.3-35-g63c274b emapper DB: 2.0

So they both look to be emapper 1.0.3 (without the git hash for the standalone version), but the DB versions reported are different, and neither of them report 5.0. I used the conda build, which reports itself as ‘emapper-dd753fc’ with emapper.py –version. I’m not sure all this versioning is correct however, If I run

emapper.py --query-cov 10 --subject-cov 10

it reports:

emapper.py: error: unrecognized arguments: --query-cov 10 --subject-cov 10

so it doesn’t look like we have equivalent software. My version was installed from bioconda:

(eggnog) login1:GO_enrichment $ which emapper.py /cluster/gjb_lab/jabbott/miniconda3/envs/eggnog/bin/emapper.py (eggnog) login1:GO_enrichment $ conda list|grep eggnog

packages in environment at /cluster/gjb_lab/jabbott/miniconda3/envs/eggnog:

eggnog-mapper 1.0.3 py_3 bioconda

I’ll try doing a ‘native’ installation and see how that compares….

If I can replicate the setup which produces the output of the standalone version I think I should have something I can work with.

Many thanks James

From: Carlos P Cantalapiedra <notifications@github.c

om> Reply to: eggnogdb/eggnog-mapper reply@reply.github.com Date: Monday, 11 May 2020 at 17:08 To: eggnogdb/eggnog-mapper eggnog-mapper@noreply.github.com Cc: "James Abbott (Staff)" j.abbott@dundee.ac.uk, Author author@noreply.github.com Subject: Re: [eggnogdb/eggnog-mapper] Confusing GO annotations (#200)

Not sure if there is some confusion here regarding database versions. Version being used in the online version (which uses eggnog-mapper v2.0) is eggnog v5.0

El lun., 11 may. 2020 a las 10:09, James Abbott (notifications@github.com) escribió:

As far as I can see the earliest version of the database available is 4.5.0 - would you be able to make available an earlier database which hasn't pulled in these associated terms? The version 2.0 used by the online version doesn't have the problem annotation for this particular sequence, so may be a good starting point, and would make the output equivalent to having used the online version. Whether this would solve the wider problem or just this particular annotation I guess we have no way of knowing without running and comparing the full dataset. My dataset is pretty big, so would require >200 submissions of 100k sequences to actually run online, which is not really practical.

Thanks, James

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eggnogdb/eggnog-mapper/issues/200#issuecomment-626544360, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIMQ3UJERPC25ORFT465YDRQ6XEDANCNFSM4M2YG2OA .

-- Carlos P. Cantalapiedra Post-doctoral researcher Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA) Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA) Campus de Montegancedo-UPM 28223-Pozuelo de Alarcón (Madrid) Spain

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/eggnogdb/eggnog-mapper/issues/200#issuecomment-626798338, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSJAR5E75RNDIEAANNLRRAPGHANCNFSM4M2YG2OA.

The University of Dundee is a registered Scottish Charity, No: SC015096

jamesabbott commented 4 years ago

OK...so installing directly from github seems to give me a version which produces the correct output:

`# emapper version: emapper-2.0.1 emapper DB: 2.0

command: ./emapper.py -i /cluster/db/jabbott/GO_enrichment/barley_filter/test.fa -o test --output_dir eggnog_annotations --scratch_dir /tmp/457.1.all.q --data_dir /media/ramdisk/457.1.all.q --database none -m diamond --cpu 12 --go_evidence non-electronic --target_orthologs all --seed_ortholog_evalue 0.001 --seed_ortholog_score 60 --override

time: Wed May 13 10:06:44 2020

query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score best_tax_level Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction taxonomic scope eggNOG OGs best eggNOG OG COG Functional cat. eggNOG free text desc.

2006_NODE_1409982_length_265_cov_0.904762_1 381666.H16_B2360 3.7e-12 77.4 Burkholderiaceae fliC GO:0005575,GO:0005576,GO:0005623,GO:0009288,GO:0042995,GO:0043226,GO:0043228,GO:0044464 ko:K02406 ko02020,ko02040,ko04621,ko04626,ko05132,ko05134,map02020,map02040,map04621,map04626,map05132,map05134 ko00000,ko00001,ko02035 Bacteria 1K01X@119060,1MV1N@1224,2VJTA@28216,COG1344@1,COG1344@2 NA|NA|NA N Flagellin is the subunit protein which polymerizes to form the filaments of bacterial flagella`

The DB is showing as 2.0, and it has a much shorter list of GO terms reported, so the bioconda installed version looks to be the problem...for some reason it seems to have given me 1.0.3 even though I installed into a new, clean environment so it shouldn't have had any dependancy issues leading to it using an old version. I'll do some more playing to try to work out what has caused this, but I am able to produce the correct output now.

Many thanks James