geneontology / go-shapes

Schema for Gene Ontology Causal Activity Models defined using RDF Shapes
2 stars 0 forks source link

Gocamgen model with has_input should pass #148

Closed dustine32 closed 3 years ago

dustine32 commented 4 years ago

In branch: https://github.com/geneontology/go-shapes/tree/dustine32-test-has_input

I have this gocamgen-generated test TTL file with a single assertion individual: image This model fails both java and python validators despite appearing to follow the ShEx spec:

Protein <-enabled_by- MolecularFunction -has_input-> MolecularEntity

Here's the python validator output:

File: ../test_ttl/go_cams/should_pass/WB_WBGene00000903_partial.ttl Success: False PASS: 4 FAIL: 1
  FAIL: http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408 SHAPE: http://purl.obolibrary.org/obo/go/shapes/MolecularFunction REASON:   Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Triples:
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type obo:GO_0005160 .
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type owl:NamedIndividual .
   2 triples exceeds max {1,1}
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
      Triples:
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type UniProtKB:P04202 .
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type owl:NamedIndividual .
   2 triples exceeds max {1,1}
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
      Testing UniProtKB:P04202 against shape http://purl.obolibrary.org/obo/go/shapes/OwlClass
           No matching triples found for predicate rdf:type
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
      Triples:
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type UniProtKB:P04202 .
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type owl:NamedIndividual .
   2 triples exceeds max {1,1}
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
         No matching triples found for predicate rdf:type
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Node kind mismatch have: URIRef expected: bnode
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
    Triples:
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type obo:GO_0005160 .
      <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type owl:NamedIndividual .
   2 triples exceeds max {1,1}
  Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
       No matching triples found for predicate rdf:type
Final report >> all files successful: False

Strange here is that I get a 2 triples exceeds max {1,1} cardinality violation for predicate rdf:type when I'm using this predicate for something that seems so fundamental to our models: "X is an Individual" and "X is of class Y".

Running the validator against the rest of the WB:WBGene00000903 model with this assertion individual removed (so that only simple GP->term assertions remain), I get a PASS result. So I think this would indicate that my general OWL syntax in these models is OK; I'm guessing it's this has_input relation that's causing problems.

@balhoff @goodb Are you able to spot anything here that I can change to get it to pass?

Thanks!

goodb commented 4 years ago

The problem with this one was that UniProtKB:P04202 doesn't appear to exist in neo in rdf.geneontology.org . Hence the constraint: has_input: @ *;

is violated by obo:RO_0002233 http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa ;

and

http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa a UniProtKB:P04202, owl:NamedIndividual .

Changing to a different identifier for the protein, which neo knows, allows the file to validate.

Possibly relevant to https://github.com/geneontology/neo/issues/24

dustine32 commented 4 years ago

@vanaukenk @ukemi This is the example line containing the "invalid" UniProtKB:

WB  WBGene00000903  enables GO:0005160  PMID:8910282|WB_REF:WBPaper00002600 ECO:0000250 UniProtKB:P04202    20141020    WB

Something that seems odd to me is that this UniProtKB:P04202 is a mouse protein, though this may be a legit example of a multi-species annotation? It is to a binding descendant term.

Regardless of the differing species issue, as @goodb suggests, we can fix most of these errors by switching to using MOD identifiers known to NEO in the with/from and extensions columns. Some ways to actually move forward:

  1. "Fix" the with/from, extensions UniProt usage upstream in the MOD GPADs by switching to MOD identifiers or other identifiers known to NEO.
  2. Load all of SwissProt into NEO

@vanaukenk @ukemi What do you think about this?

There's also the issue of variable prefix usage (http://identifiers.org/wormbase/WBGene00000903 [how go_context.jsonld and thus gocamgen resolves WB:] vs http://identifiers.org/WB:WBGene00000903 [resolvable by identifiers.org]), that could also bring this error back but, since the GPADs use CURIE's, I think this is a gocamgen->NEO problem.

ukemi commented 4 years ago

The evidence for this annotation is sequence similarity to the mouse protein. Normally we don't make binding annotations by ISS, but it seems reasonable in this case because the worm gene is a homolog of the mouse gene. It would be inferred that the worm gene might have a similar function. I'm not sure what to make about the prefix issue.

@vanaukenk should have a look when she gets back tomorrow.

ukemi commented 4 years ago

As for switching to identifiers known to NEO in the with/from extensions column. I am always in favor of verifiable entities in any field. However, I'm not sure that MOD gene identifiers will suffice. Neo contains more than just MOD gene identifiers. The GPI files also contain entities that are cross-referenced to MOD gene identifiers. I'd be more in favor of option 1 above than option 2. If we load all of SwissProt, then then I suspect there will be a lot of orphan identifiers that will not associate with any MOD gene, for a number of reasons. This will open up even more room for curation errors. We can discuss this on a call.

ukemi commented 4 years ago

A separate issue is why P04202 is not known to NEO. When I search MGI it is an xref to MGI:98725 (Tgfb1). MGI MGI:98725 Tgfb1 transforming growth factor, beta 1 TGF-beta 1|Tgfb|Tgfb-1|TGFbeta1|TGF-beta1 gene taxon:10090 UniProtKB:P04202

Shouldn't the UniProt xrefs be loaded into NEO?

ukemi commented 4 years ago

Perhaps this is an issue that only column 2 is loaded into NEO. If that is the case, we have this identifier as a PRO identifier, but not a UniProtKB identifier. We could solve this problem by adding all of the UniProtKB identifiers associated with MGI genes as entries in column 2. @mdolanme?

ukemi commented 4 years ago

But this is an interesting catch-22. We want people to annotate to mouse genes using the MGI identifier rather than a UniProtKB gene-centric representative-protein. The exclusion to this is when we annotate using proteoforms. In those cases we use PRO. Perhaps it would be better if somehow the tools recognized the xrefs in the GPI file as stand-ins for the gene identifier?

mdolanme commented 4 years ago

"We could solve this problem by adding all of the UniProtKB identifiers associated with MGI genes as entries in column 2." Do you mean duplicate the UniProt ids in column 2 with UniProtKB in column 1? We could run it by Lori.

goodb commented 4 years ago

@dustine32 is this still a problem for you? It looks like that uniprot id is still not a part of our universe: e.g. this is a 404 http://noctua-amigo.berkeleybop.org/amigo/term/UniProtKB:P04202

vanaukenk commented 3 years ago

@ukemi @dustine32

Can we close this ticket?

ukemi commented 3 years ago

Good question! The only mouse identifiers that are valid are: MGI identifiers for genes, Protein ontology identifiers for proteins and RNA sequence identifiers for RNAs. If we want curators to be able to enter the uniprot ids, then they need to be translated to either an MGI identifier or a protein ontology identifier. This could be done using the xref column in the GPI file and picking a default as to whether we want them to correspond to a gene or a protein. I think a better approach would be to add an autocomplete restriction to any field that requires a valid entity from NEO as is done in the "Add individual" or 'enabled_by' data entry fields. We should teach curators not to enter the prefix when adding entities. I like it when I enter P04202 into the field that I am given a choice of the gene, the parent protein or an isoform.

vanaukenk commented 3 years ago

Okay, I've actually updated the original C. elegans annotation that used a mouse UniProtKB accession in the With/From field to now use the MGI gene identifier.

There is still an issue of what to add in the 'has input' field, but I think the original issue here should now be dealt with.

We still want to do work on constraining the With/From field and eventually using an autocomplete for that, but that's an issue for the Noctua tracker.

Thx.

ukemi commented 3 years ago

I think the 'has input' field should be constrained to a valid ontology term. UniProt identifiers for mouse genes are not valid ontology terms, but they are 'synonyms'.

vanaukenk commented 3 years ago

@ukemi

This is a cross-species binding expt where I'll need to decide what the physiologically relevant C. elegans gene should be in the AE.

ukemi commented 3 years ago

Got it! I still think the rule should be that it has to be an ontology term though.