geneontology / noctua

Graph-based modeling environment for biology, including prototype editor and services
http://noctua.geneontology.org/
BSD 3-Clause "New" or "Revised" License
36 stars 13 forks source link

Create simple annoton form #461

Closed cmungall closed 6 years ago

cmungall commented 6 years ago

Rough sketch of concept:

image

TBD:

TODO: Assign this to @DoctorBud, invite sent

cmungall commented 6 years ago

Each evidence slot will need:

TBD: make a column for each or make a complex cell in the row

Underlying datamodel:

https://github.com/geneontology/minerva/blob/master/specs/owl-model.md

cmungall commented 6 years ago

R/W vs W-only:

In order to implement read (graph->tuple), we would need some server side functionality to avoid overmatching in the general case (constraining by var). However, for a subset of cases including the annoton one we can live with some overmatching.

Here the procedure is to match on edge predicates only.

E.g for

  - edge: [mf, 'enabled by', gp]
  - edge: [mf, 'occurs in', cc]
  - edge: [mf, 'part of', bp]

treat as SPARQL query

SELECT ?gp ?mf ?bp ?cc WHERE {
?mf enabled_by: ?gp .
?mf occurs_in: ?cc .
?mf part_of ?bp .
}

Assume that all vars are obligatory (will be filled in with root class if none known)

dosumis commented 6 years ago

Assume that all vars are obligatory (will be filled in with root class if none known)

I've been wondering whether this should be the case. Making this obligatory may => too much proliferation of templates. e.g. for TF templates it could be useful to have an optional box for a protein binding partner involved in reg transcription.

cmungall commented 6 years ago

Optionality should be easier for noctua abox in generation mode: no lexical metadata generation, each triple is it's own list element (simply drop list members if var not filled in generation; make an optional match for graph->row).

How should we indicate optionality in the yaml?

dosumis commented 6 years ago

JSON schema spec is here:

https://github.com/dosumis/dead_simple_owl_design_patterns/blob/master/spec/DOSDP_schema_full.yaml#L617

and here

https://github.com/dosumis/dead_simple_owl_design_patterns/blob/master/spec/DOSDP_schema_full.yaml#L335

I easily add an GO-CAM evidence object type and allow that in the annotations slot under opa.

cmungall commented 6 years ago

I suggest for the form in noctua we treat all as optional for now

On Jul 11, 2017 11:55, "David Osumi-Sutherland" notifications@github.com wrote:

I could add any extra boolean into the opa object to indicate optionality

Sorry - this would need to be by var. Need to think more carefully about this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geneontology/noctua/issues/461#issuecomment-314538563, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOWlc_62CmtzP62KiRiI92ID42Jehks5sM8SigaJpZM4OTfbz .

DoctorBud commented 6 years ago

I created a sketch of one way we might render the evidence and an Annoton while retaining a tabular format: Plunker v1 here

I'm using the concept of 'parent' to assign the evidence rows to a particular parent column, and the current demo always shows evidence for all parents. However, it should be doable to make it so that showing all evidence is either prohibited or optional, and that the normal mode is that the column-specific evidence subtable appears when the parent column (e.g., MF) is being edited or the user clicks a little disclosure triangle within the parent column.

DoctorBud commented 6 years ago

Of course, the 'Parent' column in the evidence subgrid would be unnecessary to display if it was otherwise clear to the user that they were looking at a particular column's evidence. Only when the user showed ALL the evidence would they need to see the rows grouped or otherwise distinguished and bound to their parent columns.

dosumis commented 6 years ago

Optionality should be easier for noctua abox in generation mode: no lexical metadata generation, each triple is it's own list element (simply drop list members if var not filled in generation; make an optional match for graph->row).

Agree.

Some thoughts on specifying vars:

(i) All MF models will need a GP slot. This could be specified by default for LEGO instance graphs (ii) The set of compulsory vars should be sufficient for MF classification (iii) Optional vars will need their own var field (I'm a little bothered by the proliferation of var field types), but don't see any other clean way to do this, given the simple key (varname) : value(range) structure.

vanaukenk commented 6 years ago

Hi,

I have a few questions about the specs:

Thx.

dosumis commented 6 years ago

Hi Kimberley,

This ticket is trying to kill two birds with one stone:

  1. Make a protein2GO like table interface for simple annotations.
  2. Implement a general system that allows us to automatically configure table-based templates for noctua. These will be driven by a templating system that is built on top of the ontology design pattern system.

Your (very reasonable) questions/requests fit under aim 1. For this ticket, I think we need to think about them in terms of how easily they are compatible with aim 2.

Will the form interface allow curators to select other relations, e.g. acts upstream of or within, between GPs and GO terms?

The template specifies: GP <-enabled_by- MF -part_of-> BP

The new relations link GO -> BP (Not GP->MF). This probably requires a separate template for adding BP-only annotations. (Note to self: would also require relation vars).

Do we want to also have slots for annotation extensions or would all contextual information be added solely via the graphical interface?

Hard to implement as a template. I'd favour pushing users to do this via the graphical interface as a way of easing them into GO-CAM modeling.

Will curators be able to pre-populate any of the annotation information in the table with existing data? For example, if I am curating a paper that describes a new role for a GP in some BP, can I begin the curation session by pre-populating the MF data for that GP from another paper?

There's no reason that template variable slots could not be filled by data from some specified source. But I suspect that functionality like this would have to be a later addition, after we get a basic implementation up and running and depending on dev resources.

srengel commented 6 years ago

Curators favor being able to do the annotation extensions via the tabular interface. Getting the extensions to work the way we want is proving problematic via the graphical interface.

vanaukenk commented 6 years ago

@dosumis Thanks for your comments. Yes, given the time/resource constraints I agree that we need to start with a basic implementation and take it from there. If one of the goals, though, is to develop a P2GO-like tabular curation interface, I just want to make sure that what we're doing now is ultimately compatible with providing the functionality that curators want.

dosumis commented 6 years ago

@dosumis Thanks for your comments. Yes, given the time/resource constraints I agree that we need to start with a basic implementation and take it from there. If one of the goals, though, is to develop a P2GO-like tabular curation interface, I just want to make sure that what we're doing now is ultimately compatible with providing the functionality that curators want.

Yep. Just pointing out that the approach outlined by Chris on this ticket (and which I'm doing part of the work for) has its limitations. It also has the great advantage that we can easily support a much large set of form-based interfaces to guide curation. If the limitations are too great, then it might not be the best way to get a P2GO type interface, but Chris will have to comment on whether other approaches fit with general dev/resource plans.

Getting the extensions to work the way we want is proving problematic via the graphical interface.

Can you give some examples? The graphical system should be clearer and more flexible than the old extensions system. If it's not then maybe there are some modeling, interface or documentation issues we need to address. Part of the problem here is that the concept of an extension doesn't fit easily into LEGO. With the anaton table (see Chris' outline in first comment above) there are slots for MF, BP, CC. If you fill out MF + BP, this is equivalent to extending an MF with part_of(BP). Whereas if you want to extend with a cell type then in GO-CAM you would need a separate box - and if you have multiple slots (MF, BP, CC) filled out, you need to specify which of the three you are trying to extend.

srengel commented 6 years ago

this is a set of annotations from a single paper PMID:25602519 that we are trying to input using Noctua:

Biological Process CDC28 GO:1905634 regulation of protein localization to chromatin IMP has_input(RLF2) CDC28 GO:0018105 peptidyl-serine phosophorylation IDA has_direct_input(RLF2) CDC7 GO:0018105 peptidyl-serine phosophorylation IDA has_direct_input(RLF2) CDC7 GO:0006468 protein phosphorylation IDA has_direct_input(RTT106) Molecular Function CDC28 GO:0004674 protein serine/threonine kinase activity IDA has_direct_input(RLF2) CDC7 GO:0004674 protein serine/threonine kinase activity IDA has_direct_input(RLF2) Cellular Component RLF2 colocalizes with GO:000785 chromatin IDA

SGD curators have expressed frustration with this because in P2GO this would have taken only moments to do and be done with it. In Noctua, 7 SGD curators have come up with 8 different models, all of which have different sets of extensions in the annotation_preview, none of which anyone likes! our consistency has dropped through the floor :(

kltm commented 6 years ago

I think the main tension here is that we want two different interfaces: One is essentially a P2G work-a-like that would be very efficient for current GO curators for making annotations as they do now. The other is essentially a method of patterned model building over the DOSDPs. While both are useful and will be necessary, I'm not sure that these are closely enough aligned to be the same thing (much less the same ticket).

krchristie commented 6 years ago

The graphical interface is, in my opinion, NOT clearer than the old extensions system. It is definitely not intuitive which individuals to put the extensions into in order to make the desired statement within the model and thus get the extensions in the right place in the Annotation Preview/GPAD. Here's an example where the two "lines" are duplicates with the exception of the location of the extensions (and a change in the gene name used in the second line in order to identify which line is generating the correct extensions): ID: gomodel:5900dc7400001088 Name: PMID-27693694-KRC location of extensions

cmungall commented 6 years ago

Thanks for the clear example

We can see the model and the labelified GPAD side by side:

http://noctua.berkeleybop.org/editor/graph/gomodel:5900dc7400001088 http://noctua.berkeleybop.org/workbench/annpreview/?model_id=gomodel:5900dc7400001088

Before thinking about the GPAD rendering I'd like to encourage first thinking about the actual biological inferences. If we have a set of processes nested inside one another, then making a statement about the location of the outermost process necessarily percolates down to the nested processes. E.g. if a russian doll is in a room, then the innermost doll is also in the room.

We can see this in your example: in the Nphp1 case, we know the MF has to occur in the cochlea. For the Tprn case, we can't infer anything about where the outer process occurs, other than it must at least overlap with the cochlea.

So the former case (Nphp1), the assertion is formally stronger - you get more inference from it.

In general you should be fine making the most accurate biological assertion you can, but I understand the desire to make a lego model that will map back to a desired GAF target. Hope this helps with how to think about it.

I would also argue that this illustrates the benefits of doing this in the graphical view. The inference reflects the graphical structure of the model. In a form view with extensions there is no such cue. But I appreciate we may not have done a good job in explaining all this.

krchristie commented 6 years ago

I know you guys keep saying think about the model, not the GPAD, but without the annotation preview, I have very little idea what the model is actually doing. In addition, since it is going to take a while to transition tools to use annotations in models rather than single lines in GPAD format, I think that getting it right in GPAD format is still pretty important, and will be for some time to come.

While I can see how one can not infer that the downstream process occur in the cochlea just because the Tprn MF occurs there, it is unclear to me why it is safe to make the inference in the other direction. Thus, in practice, getting what I think is true (regardless of whether you consider the model or the GPAD output), is pretty much trial and error, with a lot of error :(

dosumis commented 6 years ago

@srengel wrote:

In Noctua, 7 SGD curators have come up with 8 different models, all of which have different sets of extensions in the annotation_preview, none of which anyone likes! our consistency has dropped through the floor :(

The aim of the templated form system is to provide templates to guide consistency. This ticket was just a first attempt at a templated form.

For the examples you gave, I'm wondering why curators didn't just choose 'has input' relationships to gene products in the graph ('has direct input' didn't make the cut to LEGO - it's never been in RO and has never been used in the ontology). With that you should see exactly the inferences you need.

dosumis commented 6 years ago

@ukemi - Is there some doc about what gets filtered out in generation of AEs in GPAD? I'm wondering if some of the confusion curators are experiencing comes from overzealous pruning.

balhoff commented 6 years ago

@dosumis not a replacement for proper documentation, but here are all the currently allowed extension relations: https://github.com/geneontology/minerva/blob/master/minerva-converter/src/main/resources/org/geneontology/minerva/legacy/sparql/gpad-extensions.rq#L18-L102

That is only a starting point, and has not been formally reviewed or blessed.

dosumis commented 6 years ago

Thanks Jim!

Wasn't there a rule that ignored edges lacking evidence ? If so, can you confirm that this is no longer in place (doesn't look like it from the sparql.

Cheers, David

balhoff commented 6 years ago

It should no longer be in place. Seth deployed it a couple of days ago and actually I haven't yet confirmed the situation in the production Noctua.

dosumis commented 6 years ago

It should no longer be in place. Seth deployed it a couple of days ago and actually I haven't yet confirmed the situation in the production Noctua.

Ahh - so some of the confusion around how to structure the graph to get desired extensions may have been based on problems with the previous setup.

thomaspd commented 6 years ago

I'm commenting in Stacia's concerns about consistency, for PMID:25602519. I looked at all the models. It's clearly just a training issue. Stacia's and Emily's models overall are good (I have a few suggestions for each of them, see below) because they understood how to express "standard" GO annotations in Noctua, while the untrained curators, not surprisingly, did not. MF annotations are clear (MF enabled_by gp), but BP annotations are more complicated ([MF enabled by gp] part of BP), as are CC annotations ([MF enabled by gp] occurs in CC).

Here's my suggested model for this one: http://noctua.berkeleybop.org/editor/graph/gomodel:5966411600000744

It only took me a few minutes, so I don't think it was much less efficient than P2GO, and it allows me to link up the main causal chain from the paper, which I can't do in P2GO.

It differs from Emily's and Stacia's models in a couple of minor ways:

  1. I didn't try to add a protein phosphorylation process, as it's a single-step process and already covered by protein kinase activity. But if you want to add it, Emily did it properly (the MF is part of a larger process)
  2. I used a "positive regulation" edge to term X rather than part of "positive regulation of term X". This is preferable for the graph but I see it gives the wrong GPAD. The GPAD should be to the positive regulation term rather than loading the relation column. So I can see why both Emily and Stacia chose to model it the way they did.
  3. I arranged the causal chain from left to right, so it's easier to read.
  4. The two activities of CDC7 are on different substrates, so they should be separate from each other
srengel commented 6 years ago

@dosumis thanks DavidOS for commenting about the correct RO term. some of us have has_direct_input burned into our brains, and didn't realize it wasn't on the 'allow' list for Noctua. we will work to adjust our thinking and retrain our brains.

@thomaspd thanks so much Paul T for looking at all our models. lots of dirty laundry there, tho ;) your comments and your model will be really helpful. we'll be taking another look at these models in comparison to yours at our curator meeting next week. we've been doing one paper/model each week in curator meeting. people do their models ahead of time then we come together to discuss.

last week we got stuck on the topics that you guys have hit on here, so i feel good about that at least:

  1. has_direct_input v has_input
  2. how to do the protein phosphorylation, then ended up at 'should we even be adding this annotation?', sounds like no we should not
  3. keeping the activities on different substrates separate and how to do so

this is good. i'm pivoting back toward feeling encouraged.

cmungall commented 6 years ago

Now done: https://github.com/geneontology/simple-annoton-editor