Adding more information to NOT annotations

cmungall commented 9 years ago

Currently NOT annotations use a broad taxonomic grouping, but don't actually indicate what groups the negative statement pertains to. For example:

HOM:0000007     historical homology    UBERON:0002470   autopod region  NOT     7742    Vertebrata      RAW     ECO:0000033     traceable author statement      CIO:0000004     medium confidence from single evidence  PMID:23598338   Amemiya CT, Alfoeldi .... Lindblad-Toh K, The African coelacanth genome provides insights into tetrapod evolution. Nature (2013) There are two major hypotheses about the origins of the autopod; that it was a novel feature of tetrapods, and that it has antecedents in the fins of fish.      bgee    ANN     2015-02-03

Naively we may assume that the autopod is never homologous in vertebrates, but this would be wrong. What the authors are saying IMO is that they disbelieve there is homology between the tetrapod autopod and the autopods of other Sarcopterygii.

Similarly:

HOM:0000007     historical homology    UBERON:0002165   endocardium     NOT     7711    Chordata        RAW     ECO:0000071     morphological similarity evidence       CIO:0000004     medium confidence from single evidence  PMID:17223594   Davidson B, Ciona intestinalis as a model for cardiac development. Ciona intestinalis as a model for cardiac development. Semin Cell Dev Biol (2007)  There is no discernible endocardium in the Ciona heart. bgee    ANN     2013-07-19

The crucial information here is the taxon of the studied organism, Ciona, vs (presumably) vertebrates. This could be broadened to all tunicates.

I think all NOTs should be pairs of taxa. E.g. for the first Coelacanthimorpha|Tetrapoda, for the second Tunicata|Vertebrata

Summing these up to a taxon that subsumes the pair is not wrong, it's just less useful. For example, if we have expression data about coelocanths and mammals we can test the hypothesis. But it would be wrong to test the hypothesis by comparing mouse and human.

A compromise would be to narrow the bracket. E.g. for the first example, using Sarcopterygii would be better, as we can test the negative hypothesis by examing pairs of taxa immediately under this taxon. But it would still be better to reflect the statement of the authors more directly.

fbastian commented 9 years ago

cc @ANiknejad

I think all NOTs should be pairs of taxa.

It would be incorrect. Consider for instance tapetum lucidum, or nictitating membrane, which appeared independently in more than two lineages.

The crucial information here is the taxon of the studied organism, Ciona, vs (presumably) vertebrates

If you want to determine in which taxa a structure independently appeared (and so, between which taxa the homology hypothesis is rejected), as you said you must look at positive annotations, mapped to a sub-taxon of the NOT annotation. There can be more than two.

It is true that in your autopod example, we only captured the positive annotation at the tetrapoda level. This should be corrected, see issue #5.

Summing these up to a taxon that subsumes the pair is not wrong, it's just less useful.

Well, it is the intent of the NOT annotation, to reject a hypothesis that could otherwise seem plausible: by naively looking at the phylogenetic distribution, you would infer an homology hypothesis for the taxon that subsumes all taxa with the structure.

For example, if we have expression data about coelocanths and mammals we can test the hypothesis. But it would be wrong to test the hypothesis by comparing mouse and human.

When comparing species, it is needed:

to retrieve their least common ancestor;
to retrieve annotations mapped to this LCA, and all its ancestors; only positive annotations;
to filter annotations containing the same structure several times, to keep only the one mapped to the most recent taxon (e.g., when comparing human-mouse, you want to retrieve the homology at the 'lung' level, not at the 'lung-swim bladder' level).

A compromise would be to narrow the bracket. E.g. for the first example, using Sarcopterygii would be better

Fixing issue #5 will solve this problem.

So, I think the NOT annotations are correctly designed, unless you see another problem. Maybe I could generate some derived files that could help you?

fbastian commented 9 years ago

I also realize that sometimes there is no sub-taxon positive annotation. For instance, a structure is originally though to originate in vertebrata, but it then showed to originate in tetrapoda: we will add a NOT annotation at the vertebrata level. There is no pair of taxon nor sub-taxon to consider.

There can be several sub-taxon annotations only in cases of independent evolution. Otherwise, the NOT annotation is only used to capture the rejection of a previous hypothesis.

I think you're using NOT annotations in a way they're not supposed to work. Isn't it simply the common ancestral taxon/taxa of each structure that you want to retrieve?

BgeeDB / anatomical-similarity-annotations

Adding more information to NOT annotations #4