geneontology / gocamgen

Base repo for constructing GO-CAM model RDF
0 stars 0 forks source link

Chaining occurs_in annotation extensions based on ontology strucutre #25

Open ukemi opened 5 years ago

ukemi commented 5 years ago

When annotation extensions are comma delimited and are from GO_CC, CL and an anatomy ontology then the model should indicate GO_CC<>part_of<>CL<>part_of<>anatomy.

MGI | MGI:1336882 | acts_upstream_of_or_within | GO:0070625 | MGI:MGI:3801544|PMID:18535671 | ECO:0000315 | MGI:MGI:3056083 | 20160406 | MGI | occurs_in(EMAPA:35651),occurs_in(CL:0002064),occurs_in(GO:1990794)

See the bottom of model: http://noctua.geneontology.org/editor/graph/gomodel:5c4605cc00000315

goodb commented 5 years ago

note order of part_ofs is based on the ontologies uses. GO part of cell part of anatomy occurs_in(EMAPA:35651),occurs_in(CL:0002064),occurs_in(GO:1990794)

dustine32 commented 4 years ago

On the 2019-08-29 call we talked about treating commas like pipes. For this example:

occurs_in(EMAPA:17597),occurs_in(CL:0000589),occurs_in(CL:0000601)

We would create two nested assertions:

primary_term-occurs_in->CL:0000589->part_of->EMAPA:17597
primary_term-occurs_in->CL:0000601->part_of->EMAPA:17597

So the rule would be to split on same ontology (like CL here). @ukemi does this sound right? Do I have the right set of relations in the translation?

ukemi commented 4 years ago

@dustine32, this is correct. It should clean up a lot of cases where we used the incorrect delimiter in the annotation extensions.

dustine32 commented 4 years ago

Using the first annotation example above, I created this beautiful model: image

The actual original GPAD line:

MGI MGI:1336882 acts_upstream_of_or_within  GO:0070625  MGI:MGI:3801544|PMID:18535671   ECO:0000315 MGI:MGI:3056083     20160406MGI occurs_in(EMAPA:35651),occurs_in(CL:0002064),occurs_in(GO:1990794)|occurs_in(EMAPA:35651),occurs_in(CL:0002064),occurs_in(GO:0045178)

@ukemi Does this look right? I still need to tackle the "same-ontology-comma-split" issue my above comment.

dustine32 commented 4 years ago

And on the "same-ontology-comma-split" issue, I have the code now doing this: image From this annotation:

MGI     MGI:1915585     acts_upstream_of_or_within      GO:0090102      MGI:MGI:5615303|PMID:25605782   ECO:0000315     MGI:MGI:3817268         20151229        MGI     occurs_in(EMAPA:17597),occurs_in(CL:0000589),occurs_in(CL:0000601)

@ukemi Does this model look right as well?

It'll be interesting if we have multiple instances of >1 same ontology in the same extension. E.g. occurs_in(EMAPA:17597),occurs_in(EMAPA:35247),occurs_in(CL:0000589),occurs_in(CL:0000601), though I have yet to find examples of this.

dustine32 commented 4 years ago

I found some example annotations of what looks like intended location nesting but using part_of instead of occurs_in:

MGI     MGI:1336882     part_of GO:0031201      MGI:MGI:3801544|PMID:18535671   ECO:0000314                     20160406        MGI     part_of(EMAPA:35651),part_of(CL:0002064),part_of(GO:0042589)

The resulting translated assertion is currently looking like a starfish (or a 3-tentacled octopus): image

@ukemi @vanaukenk Should these annotations be fixed in the upstream GPAD or should I translate these part_ofs just like the occurs_ins?

vanaukenk commented 4 years ago

For the component example above, we want to say that the GP -> part of CC1 -> part of CC2 -> part of CL ->part of EMAPA

dustine32 commented 4 years ago

Thanks @vanaukenk ! This makes a ton of sense to me now given that its primary term is a CC. After looking at this with @tmushayahama , should that first part_of (GP -> part of CC1) be a located_in?

dustine32 commented 4 years ago

@vanaukenk In fact, even for the simple, no-extensions GP-part_of->CC GPAD lines, should we be translating that part_of qualifier to located_in?

vanaukenk commented 4 years ago

Good catch @dustine32 !

Looking at the example above, I think we actually don't have representation for this complete in the ShEx.

Currently, we have:

@\ AND EXTRA a { a @\ ; located_in: @\ {0,1}; }// rdfs:comment "an information biomacromolecule - e.g. a protein or RNA product" but I think we want the relation to be different for membership in a protein-containing complex: @\ AND EXTRA a { a @\ ; located_in: @\ {0,1}; part_of: @\ {0,1}; }// rdfs:comment "an information biomacromolecule - e.g. a protein or RNA product" That would then result in a GP being 'part of' a protein-containing complex, but 'located in' a GO CC for the straight up GP-part_of ->CC GPAD lines. If that looks okay to you, then I'll update the ShEx and we'll see if it passes :-)
dustine32 commented 4 years ago

@vanaukenk Yeah that ShEx makes sense to me. I'll look at some of the protein complex annotations to see how I'm currently translating these and then update the code accordingly. Thanks!

vanaukenk commented 4 years ago

@dustine32

I was just looking at the protein-containing complex part of the ShEx again which says that the relation between a protein-containing complex and a GO CC is 'located in'.

EXTRA a { a @\; located_in: @\ {0,1}; has_part: @\ *; } // rdfs:comment "a protein complex" This would then change the 'part of' relation between the protein-containing complex and the 'zymogen granule membrane' in the example above to 'located in'.
dustine32 commented 4 years ago

@vanaukenk Sorry, I'm slowly catching up to you as I just now realized SNARE complex is a descendant of GO:0032991. Your right, so I'll get to have more fun plugging this logic in.

vanaukenk commented 4 years ago

No worries @dustine32 The ShEx is keeping us on our toes!

dustine32 commented 4 years ago

@vanaukenk @ukemi From the annotation that I mentioned in https://github.com/geneontology/go-shapes/issues/23: image

MGI     MGI:1336882     part_of GO:0042588      MGI:MGI:3801544|PMID:18535671   ECO:0000314                     20160406        MGI part_of(EMAPA:35651),part_of(CL:0002064),part_of(GO:1990794)|part_of(EMAPA:35651),part_of(CL:0002064),part_of(GO:0045178)

This involves the CC-part_of->CC relation, which I don't currently see in the ShEx spec:

<CellularComponent> @<GoCamEntity> AND EXTRA a {
  a ( @<CellularComponentClass> OR @<NegatedCellularComponentClass> ) {1};
  part_of: @<AnatomicalEntity> {0,1};
  adjacent_to: @<AnatomicalEntity> *;
  overlaps: @<AnatomicalEntity> *;
} // rdfs:comment  "a cellular component"
dustine32 commented 4 years ago

And here's what I'm now doing for protein complex annotations: image

MGI     MGI:1336882     part_of GO:0031201      MGI:MGI:3801544|PMID:18535671   ECO:0000314                     20160406        MGI     part_of(EMAPA:35651),part_of(CL:0002064),part_of(GO:0042589)

@vanaukenk @ukemi I would be fine if you guys wanted to break the protein complex nesting into a new ticket.

ukemi commented 4 years ago

Hmm. My only concern with this is that in the ontology we use part of for complexes and other components. 1745 protein-containing complexes are part_of some cellular_component.

goodb commented 4 years ago

Maybe off the central topic, but it seems odd to recapitulate ontology relations here in the model. e.g. "pancreatic acinar cell" is part of "pancreatic acinus" in the cell ontology.

ukemi commented 4 years ago

It is odd. In my dream world, these would be shown either as a toggle or as persistent objects.

ukemi commented 4 years ago

But that said, I did this all the time (when I actually had time to make models) just so I see it in the model.

ukemi commented 4 years ago

In more complicated models, we will want to be able to attach functions to different cells that all might be part of the same anatomical structure.

goodb commented 4 years ago

It is odd. In my dream world, these would be shown either as a toggle or as persistent objects.

This looks like the start of an argument for better incorporation of ontology visualization into Noctua. I would imagine that many curators end up having one window with the ontologies they working with loaded into Protege or OLS etc. while they are working on their models. Having the ability to reveal the class structures they are using in the context of the instance models would be very powerful - especially in showing the impacts of inferences.

dustine32 commented 4 years ago

Hmm. My only concern with this is that in the ontology we use part of for complexes and other components. 1745 protein-containing complexes are part_of some cellular_component.

@ukemi Would a solution be to translate like SNARE complex -has_part-> MGI:MGI:1336882? I see that the ShEx allows both this and ProteinContainingComplex -has_part-> InformationBiomacromolecule.

Since I think @tmushayahama in Noctua form is recognizing "complex-to-GP"'s with the has_part relation, even though the two statements are equivalent(?), should we follow only one convention with complexes for consistency?

ukemi commented 4 years ago

Just noticing this now, but wanted to double-check. When we incorporate the anatomical structures in the imports, they are still EMAPA terms if that is what they were in the original annotation, right?

dustine32 commented 4 years ago

@ukemi Yep, the EMAPA's stay the same in the import. I don't xref to UBERON or anything. I only attempt to follow xref's for the GOREL extensions relations to RO/BFO.

This sub-model for MGI:MGI:1336882 for your perusal.