missing gene in annotation file compare to seed_ortholog

mariabernard commented 1 year ago

Hi,

I am using eggnog-mapper (version 2.1.8) for the first time to annotate 13M genes (protein sequence of predicted Prodigal CDS) from metagenomic data. I used default parameters except for --target_orthologs where I choose one2one.

On the 13M genes, 9.3 M get an alignment and are outputed in seed_ortholog file. I guess that the other sequences have no diamond hit or with to high evalue (> 0.001).

My question is about the 107 missing genes in the annotation file. Some of them had quite good alignement with the reference database.

#qseqid sseqid  evalue  bitscore    qstart  qend    sstart  send    pident  qcov    scov
seq1    1147152.K7PKJ2_9CAUD    0   2234    1   1137    1   1137    97.6    99.9    100

the corresponding sequence is

>seq1
MGKGSSKGHTPREAKDNLKSTQLLSVIDAISEGPVEGPVDGLKSVLLNSTPVLDSEGNTN
ISGVTVVFRAGEQDQTPPEGFESSGSETVLGTEVKYDTPITRTITSANIDRLRFTFGVQA
LVETTSKGDRNPSEVRLLVQIQRNGGWVTEKDITIKGKTTSQYLASVVVDNLPPRPFNIR
MRRMTPDSTTDQLQNKTLWSSYTEIIDVKQCYPNTALVGVQVDSEQFGSQQVSRNYHLRG
RILQVPSNYNPQTRQYSGIWDGTFKPAYSNNMAWCLWDMLTHPRYGMGKRLGAADVDKWA
LYVIGQYCDQSVPDGFGGTEPRITCNAYLTTQRKAWDVLSDFCSAMRCMPVWNGQTLTFV
QDRPSDKVWTYNRSNVVMPDDGAPFRYSFSALKDRHNAVEVNWIDPDNGWETATELVEDT
QAIARYGRNVTKMDAFGCTSRGQAHRAGLWLIKTELLETQTVDFSVGAEGLRHVPGDVIE
ICDDDYAGISTGGRVLAVNSQTRTLTLDREITLPSSGTTLISLVDGEGNPVSVEVQSVTD
GVKVKVSRVPDGVAEYSVWGLKLPTLRQRLFRCVSIRENDDGTYAITAVQHVPEKEAIVD
NGAHFDGDQSGTVNGVTPPAVQHLTAEVTADSGEYQVLARWDTPKVVKGVSFLLRLTVAA
DDGSERLVSTARTTETTYRFTQLALGRYMLTVRAVNAWGLQGDPASVSFRIAAPAAPSRI
ELTPGYFQITATPHLAVYDPTVQFEFWFSEKRIADIRQVETTARYLGTALYWIAASINIR
PGHDYYFYVRSVNTVGKSAFVEAVGRASDDAEGYLDFFKGQITESHLGKELLEKVELTED
NASRLEEFSKEWKDASDKWNAMWGVKIEQTKDGKHYVAGIGLSMEDTEEGKLSQFLVAAN
RIAFIDPANGNETPMFVAQGNQIFMNDVFLKRLTAPTITSGGSPPVFSLTSDGKLTAKNA
DISGSVNANSGTLNNVTINENCTIKGMLEATQVRGDFVKAVSKAFPKKVGTWGNTETPNG
TVTVTISDDHNFDRQIIIPPIIFNGIAYDDPGSGNNPGGTRYTGYGFEVRKNGVLIASRE
TKGAIPGSYSAVIDMPSGRGSVTLEFKIFQKGNQGAGNITDCTVIVTKKAASGISIR*

It is very small number compared to the number of output genes but I would like to understand why they are missing. Moreover how can I access to the reference 1147152.K7PKJ2_9CAUD? When I submit the example sequence on the eggnog web site I could not find this reference, but it's may be a question of version?

Curiously, most of the missing gene had alignment with reference ending with _9CAUD, does this mean something ?

Finally I have a question on acceptable filtering threshold. As I used default parameter, results are only filtered on evalue threshold and on the 3% top hits from diamond. As a result I also have very poor alignment quality. What are the recommended thresholds in terms of %identity %coverage (or some other metrics) to accept a match and the associated functionnal annotation ?

Thank you for any help you can give.

Maria

Cantalapiedra commented 1 year ago

Hi @mariabernard ,

You are right! 1147152.K7PKJ2_9CAUD seems to be missing from the eggNOG 5 website. Either:

It is a bug, and it should be present in the DB.
Or, it didn't make it to any of the OGs, and that is why it was left outside.
Or, both!

I guess that is also why most of these hits are from 9CAUD protein. If it is missing, then all the annotations are missing. In any case, I don't have the possibility to change, fix or update the eggNOG 5 database myself. My apologies for this.

Regarding parameters, I am not sure if I am able to help. Default parameters are what we thought would help on most situations, but of course every project and analysis can be different. There are some diamond parameters that you may play with, using eggnog-mapper. I sometimes use the ultra-sensitive result, for instance. If you may need further options, you may even use diamond yourself to obtain the hits you want, and then use it as input for eggnog-mapper. This being said, I am not sure that e-value threshold or 3% top hits are the reason of the poor alignment quality. Why do you believe so? You may share an example in which you would expect a better alignment against eggNOG 5 proteins, if you wish.

I hope any of this helps.

Best, Carlos

mariabernard commented 1 year ago

Hi Carlos,

Thanks for your reply.

As I am completely new in functionnal annotation, I do not have any expectation in terms of %identity and coverage. But I am surprised to have eggnog match with only 15.9% identity (average = 56.95, median = 55.1%), or with only 0.9% qcov (average = 87.33, meadian = 96.4).

By default, eggnogmapper filter alignment only on evalue (0.001), but when we have such level of minimal identity or coverage I guess there is some doubt about the annotations returned, no?

For now, I just made general checks and observations. So my question was more generally, could we trust results that come from alignment with 40%, 50% identity ?

Have a nice day

Maria

Cantalapiedra commented 1 year ago

Hi @mariabernard ,

Actually, if you look at http://eggnog-mapper.embl.de/ , under the "Search filters" options you will see that the defaults are different than the ones from the command line tool. Maybe those thresholds could be a hint.

I don't know what others do, many with more expertise than me, and there could be a lot of discussion about this, I guess, but I change the thresholds depending on the aim of my analysis. For example, if you a looking for close, within species homologs, you may look for % of identity around 70-80% and coverage around 30-50%. If you only want within species orthologs, maybe you should use 92-95% identity and around 50-75% coverage. For inter species homology you may relax %identity to 50% identity, for instance, or up to 20-30% identity for more remote hits. So, answering your last question, my bet is that you may trust 40-50% alignments, if you trust the interpretation you give to those hits.

But these are only some ideas. Don't take it as advice please, Maybe it is better if you look for studies similar to yours, and check the thresholds they use.

I hope this is of help.

Best, Carlos

mariabernard commented 1 year ago

This help a lot. Don't worry I was looking for clue of what is probably a bad annotations.

Thanks again

eggnogdb / eggnog-mapper

missing gene in annotation file compare to seed_ortholog #446