Closed dustine32 closed 3 years ago
The problem with this one was that UniProtKB:P04202 doesn't appear to exist in neo in rdf.geneontology.org . Hence the constraint:
has_input: @
is violated by obo:RO_0002233 http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa ;
and
http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa a UniProtKB:P04202, owl:NamedIndividual .
Changing to a different identifier for the protein, which neo knows, allows the file to validate.
Possibly relevant to https://github.com/geneontology/neo/issues/24
@vanaukenk @ukemi This is the example line containing the "invalid" UniProtKB:
WB WBGene00000903 enables GO:0005160 PMID:8910282|WB_REF:WBPaper00002600 ECO:0000250 UniProtKB:P04202 20141020 WB
Something that seems odd to me is that this UniProtKB:P04202 is a mouse protein, though this may be a legit example of a multi-species annotation? It is to a binding descendant term.
Regardless of the differing species issue, as @goodb suggests, we can fix most of these errors by switching to using MOD identifiers known to NEO in the with/from and extensions columns. Some ways to actually move forward:
@vanaukenk @ukemi What do you think about this?
There's also the issue of variable prefix usage (http://identifiers.org/wormbase/WBGene00000903
[how go_context.jsonld and thus gocamgen resolves WB:
] vs http://identifiers.org/WB:WBGene00000903
[resolvable by identifiers.org]), that could also bring this error back but, since the GPADs use CURIE's, I think this is a gocamgen->NEO problem.
The evidence for this annotation is sequence similarity to the mouse protein. Normally we don't make binding annotations by ISS, but it seems reasonable in this case because the worm gene is a homolog of the mouse gene. It would be inferred that the worm gene might have a similar function. I'm not sure what to make about the prefix issue.
@vanaukenk should have a look when she gets back tomorrow.
As for switching to identifiers known to NEO in the with/from extensions column. I am always in favor of verifiable entities in any field. However, I'm not sure that MOD gene identifiers will suffice. Neo contains more than just MOD gene identifiers. The GPI files also contain entities that are cross-referenced to MOD gene identifiers. I'd be more in favor of option 1 above than option 2. If we load all of SwissProt, then then I suspect there will be a lot of orphan identifiers that will not associate with any MOD gene, for a number of reasons. This will open up even more room for curation errors. We can discuss this on a call.
A separate issue is why P04202 is not known to NEO. When I search MGI it is an xref to MGI:98725 (Tgfb1). MGI MGI:98725 Tgfb1 transforming growth factor, beta 1 TGF-beta 1|Tgfb|Tgfb-1|TGFbeta1|TGF-beta1 gene taxon:10090 UniProtKB:P04202
Shouldn't the UniProt xrefs be loaded into NEO?
Perhaps this is an issue that only column 2 is loaded into NEO. If that is the case, we have this identifier as a PRO identifier, but not a UniProtKB identifier. We could solve this problem by adding all of the UniProtKB identifiers associated with MGI genes as entries in column 2. @mdolanme?
But this is an interesting catch-22. We want people to annotate to mouse genes using the MGI identifier rather than a UniProtKB gene-centric representative-protein. The exclusion to this is when we annotate using proteoforms. In those cases we use PRO. Perhaps it would be better if somehow the tools recognized the xrefs in the GPI file as stand-ins for the gene identifier?
"We could solve this problem by adding all of the UniProtKB identifiers associated with MGI genes as entries in column 2." Do you mean duplicate the UniProt ids in column 2 with UniProtKB in column 1? We could run it by Lori.
@dustine32 is this still a problem for you? It looks like that uniprot id is still not a part of our universe: e.g. this is a 404 http://noctua-amigo.berkeleybop.org/amigo/term/UniProtKB:P04202
@ukemi @dustine32
Can we close this ticket?
Good question! The only mouse identifiers that are valid are: MGI identifiers for genes, Protein ontology identifiers for proteins and RNA sequence identifiers for RNAs. If we want curators to be able to enter the uniprot ids, then they need to be translated to either an MGI identifier or a protein ontology identifier. This could be done using the xref column in the GPI file and picking a default as to whether we want them to correspond to a gene or a protein. I think a better approach would be to add an autocomplete restriction to any field that requires a valid entity from NEO as is done in the "Add individual" or 'enabled_by' data entry fields. We should teach curators not to enter the prefix when adding entities. I like it when I enter P04202 into the field that I am given a choice of the gene, the parent protein or an isoform.
Okay, I've actually updated the original C. elegans annotation that used a mouse UniProtKB accession in the With/From field to now use the MGI gene identifier.
There is still an issue of what to add in the 'has input' field, but I think the original issue here should now be dealt with.
We still want to do work on constraining the With/From field and eventually using an autocomplete for that, but that's an issue for the Noctua tracker.
Thx.
I think the 'has input' field should be constrained to a valid ontology term. UniProt identifiers for mouse genes are not valid ontology terms, but they are 'synonyms'.
@ukemi
This is a cross-species binding expt where I'll need to decide what the physiologically relevant C. elegans gene should be in the AE.
Got it! I still think the rule should be that it has to be an ontology term though.
In branch: https://github.com/geneontology/go-shapes/tree/dustine32-test-has_input
I have this gocamgen-generated test TTL file with a single assertion individual:
This model fails both java and python validators despite appearing to follow the ShEx spec:
Here's the python validator output:
Strange here is that I get a
2 triples exceeds max {1,1}
cardinality violation for predicaterdf:type
when I'm using this predicate for something that seems so fundamental to our models: "X is an Individual" and "X is of class Y".Running the validator against the rest of the WB:WBGene00000903 model with this assertion individual removed (so that only simple GP->term assertions remain), I get a PASS result. So I think this would indicate that my general OWL syntax in these models is OK; I'm guessing it's this
has_input
relation that's causing problems.@balhoff @goodb Are you able to spot anything here that I can change to get it to pass?
Thanks!