eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
571 stars 105 forks source link

Inconsistent result between EggNOG database 5.0 and eggnog-mapper #441

Open ucabuk opened 1 year ago

ucabuk commented 1 year ago

Hello,

I noticed that some of the genes I have were assigned to eukaryotes as a result of diamond blastp, but when I used eggnog-mapper they were assigned to prokaryote orthologous as a result of eggnog-mapper.

When I dug deeper to understand the source of the problem, I found some inconsistent results.

I selected a gene assigned to the prokaryote orthologous gene in the eggnog-mapper result, then submitted the same gene to the eggNOG 5 website database. In the website results, I found that the gene assigned to bacteria in the first two results was assigned to an ortholog to Eukaryote in the third.

Next, I run the Eggnog-Mapper using the same gene but this time by the command --tax_scope Eukaryota, however, I did not get any query in the result.

So my question is how can I pull if there is any eukaryote results apart from prokaryote result that is not in the first line?

Thank you. Ugur

Cantalapiedra commented 1 year ago

Hi @ucabuk ,

Could you provide some specific example or the sequence which is producing such differences, please? That way, It would be easier understanding what is happening and why.

Thank you.

Best, Carlos

ucabuk commented 1 year ago

Hi @Cantalapiedra

Thanks for your answer. You can find them below.

When I used that sequence on eggnog5 website. The first result is root,bacteria,root,bacteria,eukaryota.... However, when I used eggnog-mapper on the terminal, I cannot see eukaryota result as probably it is not shown in the top first result. That's why, I changed the parameter with --tax_scope Eukaryota. However, even If I did, the result I got is nothing.

What is the reason? How I can consider eukaryotes for this kind of situation? Because, that might be still eukaryotic protein which is prokarytoic origin. Thanks. Ugur

sequence_1 INFGPQHPAAHGVLRLVLELDGEIVERVDPHIGLLHRGTEKLIEHKTYLQAVPYFDRLDYVAPMNQEHAFCLAVEKLLGISVPKRAQLIRVLYSEIGRLLSHLLNVTTQAMDVGALTPPLWGFEEREKLMVFYERVSGSRMHAAYFRPGGVHQDMPDKLVDDIYAFCDPFLKVVDDLDSLLTGNRIFKQRNVDIAIVKLEDAWNWGFSGVLVRSAGAAWDLRKSQPYECYNELDFDIPIGKNGDCYDRYLIRMEEMRQSTKIMKQCLEKLRSAEGRGPVAVDDNKIVPPKRSEMKRSMEALIHHFKLYTEGYHVPAGEAYAAVEAPKGEFGVYLVADGTNKPYKCKIRAPGFAHLSAMDFLCKGHLLADVSAILGPKEFAFNKANLEWAQKQVTHYPEGRQQSAIIPLLWRAQEQHGGWLPEAAIRYVAEFLGMAHIRALEVATFYTMFVLQPCGTRAHVQVCGTTPCRLRGADALFEVCHNRIGHEPFVPSADGKLSWEEVECLGACVNAPMVLIWSDTYEDLTAETFEKVLDGFAKGKPVKPGPQADKDRIFKNLYGLHDWSLKGARARGAWDNTKTILEKGRDAVIEEVKSSGLRGRGGAGFPTGLKWSFMPKKNDGRPHYLVVNADESEPGTCKDREIMRHDPHLLVEGCLIAGFCMGANTGYIYVRGEFIREREHLQAAIDQAYEAKLIGKGNVHGWDFDLYVHHGAGAYICGEETALLESLEGKKGQPRLKPPFPANVGLYGCPTTVNNVESIAAVPDILRRGPSWFAGMGRPNNTGTKLFCISGHVEKPCNVEEVMGIPLRELLETHAGGVRGGWDNLLAVIPGGSSMPLVPAGADQADTLLMDFDGCRDKKSALGTAAVIVMDKSTDIIRAMARISYFYKHESCGQCTPCREGMGWMWRVLTRMAEGRAQKREIGMLMEVTQQIEGHTICAFGDGAAWPVQGLLRHFRPEIERRIVDGKEVDVPPEYTLLQACEAAGAEIPRFCYHERLSIAGNCRMCLIEVAGIPKPQASCAIGVKDLQPNKDGSPKVLNTKTPMVKKAREGVMEFLLINHPLDCPICDQGGECDLQDQAMAYGIDSSRYQENKRAVEEKYIGPLVKTIMTRCIHCTRCIRFSTEVAGVSELGAIGRGEDMEITTYLEHAMSSELQSNVVDLCPVGALTSKPYAFAARPWELNKTQSVDVMDAVGSAIRIDTRGREVMRILPRINDDVNEEWISDKTRHVVDGLRTQRLDQPYIRSNGRLRPATWAEAFAAIAEKVKAAGKNVGAIAGDLAGVEEMFALKDLMTRLGSANIDARQDGAAFDPAWGRASYLFNSTIAGIERADALLIIGANPRREAAVLNARIRKRWRAGNFPVALIGEKADLTYTYDYLGAGAETLAGLAKTKFAETLKAAANPLIILGAGAVARKDGAALASLAAKAALEYGAIAEGWNGFSVLHTAASRVGALDIGFVPGKGGKTAAEMAAGGADVLFLLGADEIDVAPGSFVVYIGTHGDKGAHRADVILPGAAYPEKSAIYVNTEGRVQMAQRAAFPPGDAREDWAILRALSDVLGHKLPYDSLGALRQAVFAAHPHMMRIDQIAPGDAANIGTLANLGGSFDKAPLRATVTDFYLTNPIARASATMAECSAILLVGIAYVLLADRKIWAAVQMRRGPNVVGPWGLFQSFADLLKFVLKEPVIPSGSNKGVFLLAPLVTCVLALAAWAVIPVNVGWAIADINVGVLYIFAISSLMVYGIIMAGWASNSKYAFLAAVRSAAQMVSYEVSIGFVIITVLLCAGSLNLTAIVNAQDGPYGLLGWYWLPLFPMFIVFFISALAETNRPPFDLVEAESELVAGFMVEYGSSPYMMFMLGEYVAIVTMCAMATIMFLGGWLPPVPYAPFTWVPGVIWFTLKVLFMFFMFAMVKAIVPRYRYDQLMRLGWKVFLPLSLAMVAIVAAVIGEHAQRRYSNGEERCIACKLCEAICPAQAITIEAGPRRNDGTRRTTRYDIDMVKCIYCGLCQEACPVDAIVEGPNFEFATETREELYYDKERLLANGDRWEREIAKNIALDGVQPMTIGLGHYLSVAAILFTLGIFGIFLNRKNVIIILMSIELILLAVNINLVAFSAHLGDIVGQIYALFVLTVAAAEAAIGLAILVVYFRNRGSIAAAGSRTAELITTTLLMISMILSWIAFVQVGFGHADVRVPIFTWIASGDLKIEWALRIDTLTAVMLVVVNTVSAFVHLYSIGYMNEDPYRPRFFAYLSIFTFFMLMLVTSDNLVQMFFGWEGVGLASYLLIGFWYHKPEANAAAIKAFVVNRVGDFGFALGIFALFAMVGAVDLDTVFAQAPSLTGKTMWFFGYHPDALTIICLLLFMGAMGKSAQFLLHTWLPDAMEGPTPVSALIHAATMVTAGVFMVARLSPLFELAPNAQTFVTFIGATTAIFAATIGLVQNDIKRIVAYSTCSQLGYMFVAMGCGAYSVGMFHLFTHAFFKALLFLGSGSVIHAMHHEQDIRHMGGLKDRIPFTYIVMIVGTLALTGFPLTAGYFSKDAIIEAAYVGKNPMALYAFVCTVAAALLTSFYSWRLIFKTFHGEPHDRKHWKEAHESPMTMLIPLGFLAAGSVLAGLPFKEVFAGHGVEGFFREALVFAKTNTVLDDMHHVPLHIALLPTVMMAIGFAIAWHFYIRRPDIPVELARQHDFLYRFLLNKWYFDELYEIIFVKPAKWIGRELWKKGDGWLIDGFGPDGVSARVLDVTRNVVRLQTGYLYHYAFAMLIGAAAFITWFMAKGTARWVAMWTTLVTFAISLVMVVRFDPTSADFQFSENHPWLGVANYHMGVDGISLPFVILTTALMPICILASWTSIQKRVKEYMIAFLVLETLMIGTFSALDLVLFYLFFEGGLIPMFLIIGVWGGQRRVYASFKFFLYTLLGSVLMLLAIMAMYWEAGTTDIPALMKHGFPLGMQKWAWLAFFASFAVKMPMWPVHTWLPDAHVEAPTAGSVILAAILLKMGGYGFLRFSIPMFPVASHDFAPLIFTLSVVAIIYTSLVALMQEDVKKLIAYSSVAHMGFVTMGIFAGTAQGIAGGVFQMISHGIVSGALFLCVGVVYDRMHTREIAFYGGLVNRMPVYAAIFMIFTMANVGLPGTSGFVGEFLVLIGTFKNNIAVAFFATFGVILSACYALWLYRKMIFGPLKPALAGINDIGWREAVIFAPLVILTILFGVAPKPVLDMSAASVTQLLDGYNKAIKTAESERSASIVNAWCIALLVLVAVTLLYVPGGRTELFGGSFVVDDYARFLKLLAITGSAGALMLSLDYLSMDKQQRFEYGVLFLLSTLGMMMLISANDLIALYLGLELMSLPLYVVAASNRDSLRSTEAGLKYFVLGALSSGMLLYGASLIYGFTGTVNFAGIAKATSSGAGIGLIFGLVFLFVGFCFKISAVPFHMWTPDVYEGSPTPVTAFFAAAPKVAGIAIFVRATVVAFPSITHEWQQIVVFVSIASMVLGAFAAIGQKNIKRLMAYSSIGHMGFALVGLAAGTQEGVQGVLVYMSIYVVMTLGTFACILAMRRDGMLVENISDLAGLSRTQPAMAFFLAMLLFSLAGIPPLAGFFAKFYVFLAAIKAGLYVLAVIGVLASVVGAYYYLTIVKIMYFDEPAKSFQAMPGLLKLVLAVAGLINILFFAYPGPLLGAATDLAGVRHIAYETLGSTNAEALARARAGERGPLWITAATQSAGRGRRGSTWVSGPGNLFATLLLTEPSPPEAAPQLSFVSGLALHDALAECAPQLGPLLKLKWPNDLLLGGAKLAGILIEGESDPAFAVAIGIGVNCAAHPNDTPYPAADLATSGALVSPTQVLDVLSRAMNRRLEQWQRGQGFASVRVDWLKRAAGLGQDIRVRLPERELSGRFQGLDDMGRLLLQAANGVTTVTAGEVF

Cantalapiedra commented 1 year ago

Hi @ucabuk ,

Thank you for your answer. Maybe the reason is that in eggNOG-mapper the best hit is to a protein which is from a Bacteria OG. In this case, even if you try to limit the scope to Eukaryota, the intersection will be empty.

Could you share the annotations results which you obtain from eggnog-mapper for the sequence you shared above, to see if this is the case?

Ideally, if the protein was from Eukaryota, even being of Bacteria origin, the best hit should be from Eukaryota. Of course, this may be not true due to limitations on eggNOG5 database, or other technical issues. If you want to target only Eukaryota hits, then you should maybe focus on the first eggnog-mapper stage (the "search" stage). Why? Because the annotation stage uses the best hit from the search stage to determine the candidate OGs to be used for annotation. If the hit protein is from Bacteria, then the chosen OGs will be from Bacteria. Therefore, to tune the search stage, you may need to modify the reference DB used either by diamond, mmseqs or HMMER. With HMMER, you may use the hmm models from eukaryota only. When using diamond or mmseqs, you may need to create your own DB with eggNOG5 proteins from eukaryota only. You may check the use of create_dbs.py here: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#setup

I hope any of this makes sense.

Best, Carlos

ucabuk commented 1 year ago

Hi @Cantalapiedra

Thank you for your answer. You've got the point. I have bacteria result from the same protein in eggNOG-mapper result and If I limit the scope to Eukaryota, the result is empty as you told. Is that mean eukaryota protein is kind of bacteria origin? So, can I still use the result for eukaryota even OG is bacteria? I think, the function should be similar, right?

Regarding your second suggestion about the "search" stage. I would not do that as the database will be biased. However, I would like to make a small suggestion. I do not know what is the algorithm behind the best hit but, would it make sense If you add LCA approach (option), similar to taxonomy, to choose best-hit or present 4-5 result at once?

Best, Ugur

Cantalapiedra commented 1 year ago

Hi @ucabuk ,

For what you said, I understand that you know that the protein is from Eukaryota? I am not sure if you could say, from a emapper result, that the protein is from Bacteria origin. You could say that there are homologs (maybe orthologs?) in Bacteria. If they are orthologs, maybe the function is similar. But I am not an expert on this and have no means to really say that the functions will be similar or not. Sorry.

You say you don't want to change the "scope" during the search stage, but you wanted to change the scope during the annotation one. In that case, emapper algorithm has no means currently to give priority to the Eukaryota result over the Bacteria one, if I am not mistaken.

Regarding the LCA approach, thank you very much for the suggestion. I guess that in your example it would lead to report annotations at the root level, if any. Of course, there are always difficult questions to address, like which e-value/score thresholds to use to consider hits for the LCA algorithm, or that the scope of annotations would be much broader in many cases, since the OGs would be only the ones at higher levels, which I am not sure it is the best approach in all cases...

We are always trying to improve the emapper algorithm with ideas as yours. However, to be honest, our current main limitation is the size of the database (e.g. eggNOG 6), since we usually prioritize, for eggnog-mapper, speed and the possibility to cope with large datasets in a reasonable time. Also, if I am not mistaken, there are other tools which use an approach similar to the one you suggest, like Mantis.

Best, Carlos