geneontology / gocamgen

Base repo for constructing GO-CAM model RDF
0 stars 0 forks source link

Pipe-separated annotation extensions should result in separate GO-CAM annotations #40

Open vanaukenk opened 5 years ago

vanaukenk commented 5 years ago

We have this in the Google doc but just to note an example here for testing purposes:

The inputs on the mec-3 contributes_to RNA polymerase II regulatory region, sequence-specific DNA binding are pipe-separated and should be split out into separate annotations.

dustine32 commented 5 years ago

For @ukemi cuz I found this example in the MGI file:

MGI     MGI:2159711     part_of GO:0044297      MGI:MGI:4361056|PMID:19684588   ECO:0000314                     20111103        MGI     part_of(EMAPA:16525),part_of(CL:0000678)|part_of(EMAPA:16525),part_of(CL:0000678)

It looks like the pipe-separated values are duplicates. Should I take the liberty of consolidating these dupes into one extension or should I leave them alone and emit them separately in the model?

Also, minor: when counting the occurrences of a pattern for reporting (like in our pattern spreadsheet) would I count this example as one or two occurrences of part_of(EMAPA),part_of(CL)?

ukemi commented 5 years ago

Hi @dustine32, I have found a few of these too. It looks like the curator cut and pasted the same info twice. You should consolidate exact duplicates. If this were not duplicated, it would count as two occurrences because it would be split into two annotations each with a separate part_of(EMAPA1),part_of(CL1) and part_of(EMAPA2),part_of(CL2)

ukemi commented 5 years ago

Note that this would be nested in a GO-CAM where the cell (CL) would be a part of the anatomical structure (EMAPA).

dustine32 commented 5 years ago

@ukemi Exactly what I needed to know. Thank you!

dustine32 commented 5 years ago

This is mostly ready to test on noctua-dev. The two aspects of this ticket:

  1. Splitting pipe-separated extensions into multiple annotations. This model for WB:WBGene00003167 shows that the "has input" extensions are now separated.
  2. Condensing duplicated extension values. Our example of this MGI:MGI:2159711 still has two annotation individuals for Usp33-part of->cell body in noctua-dev though I've fixed it in my local instance at USC.

I'll try getting one more push into noctua-dev before the meeting (would like to get a start of has_regulation_target in too), hopefully today or tomorrow.

ukemi commented 5 years ago

It looks like the consolidation in the model above is for two annotations that are exact duplicates. Both of the evidence statements are exactly the same. Didn't we decide we wanted to only count these once? Are there exact duplicate annotations in the GPAD file?

ukemi commented 5 years ago

When I look in our editorial interface, I see two annotations to cell body that are identical except for an additional note that will eventually be loaded into a text field. It represents two different developmental stages. The cell type and anatomy extensions are still missing from the GO-CAM model. It should indicate that the cell body is part of a commisural neuron that is part of the future spinal cord.

ukemi commented 5 years ago

It might be best to look at this together along with the GPAD file. This is an interesting twist.

dustine32 commented 5 years ago

@ukemi Ahh, that explains a lot! Checking the GPAD file used,

source_path: http://www.informatics.jax.org/downloads/reports/mgi.gpa.gz
header_date: 04/03/2019

I only see the one line:

$ grep MGI:2159711 mgi.gpa | grep GO:0044297
MGI MGI:2159711 part_of GO:0044297  MGI:MGI:4361056|PMID:19684588   ECO:0000314         20111103    MGI part_of(EMAPA:16525),part_of(CL:0000678)|part_of(EMAPA:16525),part_of(CL:0000678)

And here I see the "duplicated" extensions and no notes. This is the MGI GPAD upstream of the GO pipeline so my guess is the GPAD export process from MGI is doing this. Maybe this situation will be handled in the one-off import file?

dustine32 commented 4 years ago

For closing this ticket, here's an updated model WB:WBGene00003167 showing the with/from annotation split from a binding descendant GPAD line: image From this GPAD line:

WB      WBGene00003167  contributes_to  GO:0000977      PMID:9735371|WB_REF:WBPaper00003265     ECO:0000314                     20140910        WB      has_direct_input(WB:WBGene00003168)|has_direct_input(WB:WBGene00003171)|has_direct_input(WB:WBGene00036254)

@vanaukenk @ukemi Feel free to close if this looks good.

dustine32 commented 4 years ago

@vanaukenk @ukemi Actually, looking at those "contributes to" relations while transforming into ShExCop, I don't see any mention of "contributes to" in the ShEx spec at all. Are we still using this relation in the imports?