Question: choosing the right value of 'target_orthologs'

alimayy commented 4 years ago

Hi, could you give some insight into when one would choose different values of 'target_orthologs'?

--target_orthologs {one2one,many2one,one2many,many2many,all} defines what type of orthologs should be used for functional transfer

For instance, when the input protein sequences are from

a well-studied organism/genus
a complex (soil, ocean) metagenome sample

or when one wants to

minimise the number of wrong annotation(s)/false positives
maximise the number of sequences annotated e.g. with EC numbers, KEGG, COG, etc.

Thanks in advance

Cantalapiedra commented 4 years ago

Hi,

I would say in general that:

To minimise the number of wrong annotations/false positives: one2one
To maximise the number of annotations: all

Regarding the kind of sample, if you can narrow the tax_scope, because you are working with a well characterized genus, then maybe you can trust the co-orthology relationships and the existing functional annotations more than if you work with a broader tax_scope, in which case you may wish to test what you obtain with one2one. However, I would say that the quality of the annotations you get will vary depending on the specific protein family, how well known it is and how it evolves (what is the topology of the protein family regarding paralogs, orthologs, horizontal gene transfer, rate of gain and loss of family members, and so on). All of this is my guess.

Another option could be running eggnog-mapper with one2one, and then perform another run with all for those queries without annotation from the previous step.

Best, Carlos

alimayy commented 4 years ago

Thanks @Cantalapiedra for the elaborate answer!

_if you can narrow the tax_scope, because you are working with a well characterized genus, then maybe you can trust the co-orthology relationships and the existing functional annotations more than if you work with a broader taxscope

Do you mean that if the tax_scope of the well-characterised organism/genus is provided as input, then the possibly negative impact of 'all' in terms of higher false positives would be less/minimal?

Cantalapiedra commented 4 years ago

Hi,

The narrower the tax_scope I would expect less positives, and thus less false positives. If you are confident that you are getting enough quantity and quality annotations using a narrower tax_scope, I would use that and the default --target_orthologs.

However, if you are observing that in your analysis the factor increasing false positives is the 'all' orthologs parameter, then you could give 'one2one' a try. Although I guess it is difficult to say, since also it should be different for different protein families.

As I said above, I would try (maybe with a subset of queries) first a more stringent analysis (tax_scope = whatever_fits_your_data, orthologs 'one2one'), and maybe another analysis with more sensitivity (tax_scope = auto, orthologs 'all'), and take a look at the results. Or run both steps: first for all queries, second for only those without (enough) annotation from the previous step.

Yet, the impact of many2one many2many one2many relationships is something that we could try to assess in the future.

Glad to try to help, and thank you for your questions.

Best, Carlos

alimayy commented 4 years ago

Thanks @Cantalapiedra. To give some background on my question, I work with well-studied genera like Lactococcus. I haven't checked the emapper results for false positives with target_orhtologys=all, but assuming that target_orthologs='all' would be an 'overkill' to annotate these (presumably) well-studied proteins, I used 'one2one' (without specifying a tax_scope). But from your explanation I understand that specifying the tax_scope (e.g. =Streptococcus) and using target_orthologys=al' is also a reasonable option, which I will test as you suggested.

Many thanks again, and good luck with the release of the refactor version!

Cantalapiedra commented 4 years ago

Thank you very much @alimayy I think we are close to merge refactor to the master branch. Yes, I guess I would pay more attention to one2one vs many2many when working with a protein family which I suspect could introduce false positives due to different functions of paralogs. I encourage you to try with tax_scope (sorry, not familiar with Lactococcus), and check how good are you annotations.

I will close the issue. Feel free to open new issue or re-open this in case you need further help or explanations.

Best, Carlos

eggnogdb / eggnog-mapper

Question: choosing the right value of 'target_orthologs' #238