Document modeling patterns for species-generic proteins

cmungall commented 4 years ago

(quick notes from Translator E/P call)

Ideally all assertions would be made on species-specific proteins. However, this is not always possible, e.g NLP. Here PRO classes are used that are superclasses of ss classes (which are in turn mapped to uniprot GCRPs)

What should the modeling patterns be when assertions can only be made at the sn (species-neutral) level?

simply treat these sn proteins as type bl:Protein, use bl:subClassOf to link between
have a new class such as bl:ProteinGrouping, keep the same relation

2 gives us a more explicit indication that this is something different than a uniprot level entity.

Aside: Readers familiar with OBO will note that these are all proteins, but recall that in bl, human ssh etc are instances not classes, so having ProteinGrouping as a class is valid. We have a precedent in isoforms #230. In OBO terms this can be thought of as metaclasses, which are modeled using subset tags in PRO.

Regardless of 1 vs 2 we need to think how this would fit in with a more explicit representation of protein phylogeny, e.g if we were to import panther trees, ensembl compara etc. It could get very confusing having an entity for both species-neutral integrin and ancestral-integrin. They would have the same 'children', just by a different relationship type.

IMHO the phylogenetic view is more useful and biologically valid for inference etc, but going back to the NLP use case it would be

Additionally is there expectation of horizontal homology relationships being asserted as well as vertical subclassof/descendendfrom? I would say yes. Again this is more straightforward from a phylogenetic perspective.

Or do we expect that consumers of KGs have essentially two modes of inference - one going up and down subClassOf hierarchies (what are the rules?) vs the more traditional and evolutionarily justified propagation over orthology

cartmanbeck commented 4 years ago

There are a few different things going on here that I feel are important to point out. First off, I would definitely be interested in learning more about why NLP algorithms are unable to assign species to protein names when they're found in text... if nothing else, there are very specific gene and protein name conventions that could be used to partially assign... for example, if a gene is in italics with the first letter capitalized, it's from a mouse... whereas if it's in all caps in italics, it's from a human OR other higher-order species like primates. However, if we assume that assignments of species like those above are not easy or even impossible at the NLP level, are there things that we could do to post-compute them? Some sort of cross-referencing? I would be VERY interested in expanding on this concept.

LEHunter commented 4 years ago

The text mining KP simply cannot resolve the species for all protein references in text; authors are often ambiguous, even intentionally. The non-species-specific Protein Ontology terms we use in these cases are defined to be homologous families, not an ancestral protein (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5393632/ "Family: refers to the class of proteins translated from a specific set of ancestrally related genes. Proteins in this class can be traced back to a common ancestor showing homology over the entire length of the protein."), so of the above options, #2 is most closely related to the PR approach.

cmungall commented 4 years ago

Number1 is also consistent with PRO, with the addition of some kind of tag that corresponds to the PRO level/subset (https://proconsortium.org/PRO_QA.pdf). In #230 when discussing protein isoforms @cbizon suggested just using protein and dealing with the level distinction using a node property.

I favor 2, but have not yet fully articulated the arguments

cmungall commented 4 years ago

I should also note that it's hard to discuss the various issues here without an agreement on semantics.

For example, it may seem that abstracting a statement to a sn level (e.g. A interacts with B) is making a weaker statement, but this depends on a particular interpretation. But if we have humanA subClassOf A, then someone might be justified in inferring that humanA interacts with some B (consistent with an interpretation of A interacts-with B as a TBox axiom A SubClassOf interacts-with some B).

This is covered in #230; TL;DR we should interpret an unadorned edge about two classes (e.g. PRO classes) as having some-some interpretation in owlstar, and if someone wants to make a stronger statement they explicitly add edge properties. This means that Larry's abstraction to sn classes is safe and valid.

LEHunter commented 4 years ago

I agree @cartmanbeck, species is going often be a post-NLP inference. The primary reason that NLP can't resolve the species is that language is often ambiguous, even taking into account naming conventions (which are not always followed!). There are also ambiguities (in texts) about isoforms, PTMs, etc. NLP identifies all the species that are mentioned in a document, so some kind of disjunctive restriction (e.g. this protein is either the human or the mouse one) could fairly straightforwardly be inferred.

Our NLP approach is to hew closely to what is expressed directly and locally in a text (because even that is very hard to get right!). While human beings would easily infer that a protein reference has to be from one of the species mentioned elsewhere in a document, we consider that beyond the scope (and ability) of NLP approaches. I also agree with @cmungall that what we are reporting is safe and valid. Further inference has to happen elsewhere.

cartmanbeck commented 4 years ago

I think we should have a discussion on how to do the post-NLP inferences sooner rather than later, then. It will make a huge difference to our users, I think.

cmungall commented 4 years ago

@LEHunter - sounds like we need an ARA for doing these kinds of inferences. This could support traditional inference based on phylogeny , as well as kind of probabilistic ontological approach (e.g. pushing edges down from a generic superclass to a a species-specific one, potentially aided by informed from the NLP provider, e.g. PMID-mentions-taxon).

Going beyond modeling and the scope of this repo, but in the interest of keeping discussions together, what do you think of an experiment of making potentially less accurate but more granular statements at the NLP provider level? The NLP provider is best placed to make the best inference here. Of course there will be false positives, but this is true for any mining of gene/protein names, no? If the edge has sufficient provenance/evidence/confidence attached, ARAs can handle this appropriately and ultimately yield more useful answers for Translator? Do we have any way of doing Translator-wide experiments like this?

LEHunter commented 4 years ago

I don't love the idea of making intentionally unsafe guesses as part of the NLP (and I am sure Mike Bada would hate it). The painful path that got us here is based on the fact that human annotators can't tell (or can't agree on) which species is correct. It's not just a matter of NLP being unable to do it, people can't do it. If annotators agree on the species of a protein mention, the NLP is already penalized for failing to capture that. It's only in the case of genuinely ambiguous mentions that we intend to return non-specific classes. The high proportion of protein mentions in texts that are ambiguous (to people!) was initially surprising to me.

I am fine with an ARA making guesses. A good heuristic might be: an ambiguous protein mention in article that mentions two species, human and some other, the protein is probably from the non-human species. :-)

cartmanbeck commented 4 years ago

@LEHunter I think an important question for me here is: how many of these assertions are actually species-ambiguous? Is it 25%? 50%? or is it so rampant that the core assumption is that any reference to a protein name is species-ambiguous?

bill-baumgartner commented 4 years ago

@cartmanbeck I think that's a great question. We'll have to look into the answer however. We don't specifically target the unambiguous mentions currently, but I think we can come up with a high-precision pipeline that starts to identify the protein mentions that are clearly species-specific, e.g. "human p53".

bill-baumgartner commented 4 years ago

Looking at Protein Ontology concept annotations in the CRAFT Corpus we see the following distribution:

species-specific: 805 (3.4%) NOT species-specific: 22,642 (96.6%)

mikebada commented 4 years ago

In the CRAFT Corpus, which is the basis for our training, nearly all of the gene/gene-product (GGP) mentions are annotated with species-nonspecific PRO classes. (The one minor exception are tokens incorporating the species as attached prefixes, e.g., hABC, where h = human, as we generally don't break up tokens.) This is definitely purposeful. Actually, for GGP annotation in CRAFT we initially used a species-specific vocabulary (Entrez Gene), which was really difficult for the human annotators (including me), which is why we switched to using the species-nonspecific PRO classes almost exclusively for GGP mentions.

@cartmanbeck This sounds like a riddle, but the proportion of GGP mentions that are species-ambiguous is itself ambiguous; that is, in our experience it often wasn't clear if a given mention is entirely unambiguous, and one person may consider the mention ambiguous and another not. So, the easy solution is to just always the species-nonspecific form for the GGP itself.

However, for mentions such as "human p53", we would annotate "p53" with the species-nonspecific PRO class and "human" with NCBITaxon:'Homo sapiens'. These two concept annotations would then be relationally linked. (We are currently creating gold-standard CRAFT assertion annotations, and we'll train our systems on these.) So, in cases like these, we could compose the species-specific, or at least more generally taxon-specific, form of the GGP.

mikebada commented 4 years ago

I forgot that the one other type of mention for which species-specific PRO classes are used are those for which the GGP name is specific to a species or relatively specific taxon, e.g., doublesex (fruit fly), Flp (yeast), BamHI (B. amyloliquifaciens) (all real CRAFT examples).

mikebada commented 4 years ago

Finally, the CRAFT GGP annotations are not only almost entirely species-nonspecific but also entirely nonspecific as to gene/transcript/protein. For CRAFT we've actually created a parallel hierarchy of the PRO classes to represent corresponding classes that are GGP-nonspecific. @cmungall So, I think it'd be a good idea for there to be some kind of class/construct in Biolink that would allow us to easily create a GGP-nonspecific class defined in terms of an arbitrary PRO class.

bill-baumgartner commented 4 years ago

To follow up on @mikebada's comments regarding the CRAFT annotation guidelines, it looks like there are 604 PRO annotations in CRAFT that have an immediately preceding NCBITaxon annotation, e.g. human CLN2.

mikebada commented 4 years ago

@bill-baumgartner Is this from checking via the assertion annotations? There could be more that would be relationally linked but not directly preceding, e.g., CLN2 from control mice. Additionally, there are very likely even more that would be reasonably safe to transitively infer from two or more of these positional/derivational relations, e.g., CLN2 from mouse livers, via CLN2 -> liver -> mouse.

cmungall commented 4 years ago

@mikebada:

Finally, the CRAFT GGP annotations are not only almost entirely species-nonspecific but also entirely nonspecific as to gene/transcript/protein. For CRAFT we've actually created a parallel hierarchy of the PRO classes to represent corresponding classes that are GGP-nonspecific. @cmungall So, I think it'd be a good idea for there to be some kind of class/construct in Biolink that would allow us to easily create a GGP-nonspecific class defined in terms of an arbitrary PRO class.

We allow you to conflate GP but not T:

https://biolink.github.io/biolink-model/docs/GeneOrGeneProduct

An ongoing theme in multiple projects is whether we want to have a stricter ontological separation between Gs and Ps, with edges at the appropriate place (e.g. physical interactions between Ps and genetic interactions between Gs) vs deliberate conflation, with IDs potentially interpreted as proxies.

Your approach of creating IDs that explicitly represent the union is formally correct, and seems analogous to Stefan Schulz's proposal for disease representation: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102895/

I assume you don't intend to expose your own union IDs and instead are using the PROs as proxies (even where the statement in the paper can only refer to genes). This seems practical as introducing a new set of IDs and using UnionOf semantics would likely confuse a lot of people, but it seems as a community we need a more satisfactory approach.

One thing we can explore is allowing post-composition of this kind of conflation. E.g. have an edge property that says we are hedging that the edge may pertain to the explicit subject/object, OR the subject/object may act as a 1:1 cognate over a particular predicate e.g. encoded_by.

E.g. in RDF*

A type Protein
B type Protein
<<A :interacts-with B>> os:subject_and_object_potentially_conflates_over bl:encoded-by

This seems abstruse but a key point is this additional edge property could be dropped and the graph would look exactly the same way all KGs currently look, where the proxying is ad-hoc and implicit rather than explicit.

A minor point:

forgot that the one other type of mention for which species-specific PRO classes are used are those for which the GGP name is specific to a species or relatively specific taxon, e.g., doublesex (fruit fly), Flp (yeast), BamHI (B. amyloliquifaciens) (all real CRAFT examples).

Are those names truly specific to individual species? When you say yeast do you mean Saccharomyces cerevisiae and not Schizosaccharomyces pombe (both have Flippases AFAICT). Of course this is often more obvious in papers where authors are not incentivized to conflate two Dipterans or Fungi in the same way they are incentivized to make their mouse paper seem human relevant. Still, your 96.6% seems high conservative.

mikebada commented 4 years ago

@cmungall:

We allow you to conflate GP but not T: https://biolink.github.io/biolink-model/docs/GeneOrGeneProduct

It looks like GeneProduct includes transcripts (which I think is correct):

GeneProduct - The functional molecular product of a single gene. Gene products are either proteins or functional RNA molecules

So it looks like we'd just need a way to dynamically define a corresponding GeneOrGeneProduct class defined in terms of a given PRO class. For Translator, we're currently just using the PRO classes directly, at least for now, and trying to communicate that these are proxies for the proteins or their corresponding genes or transcripts.

However, I think this is only part of a bigger issue. For example, we've also created a parallel hierarchy for all the ChEBI roles and another for all the GO MFs to represent the material bearers rather than roles and functions, respectively. (I'm not sure if we're currently using those or just the base ChEBI/GO_MF classes for Translator, though.) To facilitate annotation we also have a bunch of extension classes that are various kinds of conflations of concepts within ontologies and a bunch more that are conflations and/or semantic unifications of concepts across ontologies, but we can discuss those at some later time.

Regarding the examples of the species-specific GGP annotation examples above, we used those because they exist as species-specific PRO classes for which more generic species-nonspecific classes didn't exist (and I thought they were contextually correct). I should point out that with regard to annotation of organisms (with the NCBI Taxonomy), I've been very conservative. For example, we always use Mus (NCBITaxon:10088) to annotate mentions of "mouse" rather than Mus musculus (NCBITaxon:10090); analogously, the exact NCBITaxon:7227 (Drosophila melanogaster) is only used to mark up explicit mentions of the species, while "fruit fly" is annotated with NCBITaxon:7215 (Drosophila = fruit flies) and "fly" with NCBITaxon:7147 (Diptera = flies).

As for the 96.6% species-nonspecific GGP annotation proportion, keep in mind that these refer to the direct annotations of the GGP mentions; that is, it doesn't make use of context, even in cases such as "human p53", for which "human" and "p53" are separately marked up with an NCBITaxon class and a species-nonspecific PRO class, respectively. However, as mentioned previously, once we finish the manual assertion annotations and train on them, we should be able to compose these to deduce or abduce more taxonomically specific forms.

sierra-moxon commented 3 years ago

@sierra-moxon will see if there are any tickets to be made from this apart from the flexible-conflation discussion and close this discussion ticket in favor of that work.

sierra-moxon commented 2 years ago

I think our decision so far has been to conflate reference proteins with genes in translator (and this is implemented in node normalizer with a switch allowing for conflation or not). I'm going to close this for now as "done" but please reopen if it needs further work w/re to PRO assignments, etc, for text mining.

biolink / biolink-model

Document modeling patterns for species-generic proteins #458