geneontology / noctua-models

This is the data repository for the models created and edited with the Noctua tool stack for GO.
http://noctua.geneontology.org/
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

Possible issues with the latest SynGO model batch #55

Closed kltm closed 6 years ago

kltm commented 6 years ago

From https://github.com/geneontology/noctua-models/pull/54 Looking at http://noctua-dev.berkeleybop.org/editor/graph/gomodel:SYNGO_1371 as an example, apparent issues with SynGO models:

Tagging @cmungall @dosumis

kltm commented 6 years ago

Apparently some of these were in progress:

https://github.com/geneontology/noctua-models/pull/43#issuecomment-326085314 https://github.com/geneontology/syngo2lego/issues/1

dosumis commented 6 years ago

HI Seth,

Will try to get these fixed today - at least for cases where this is under my control:

dosumis commented 6 years ago

@kltm - Looking at a randomly chosen model (gomodel-59a9023b00000151) it looks like edges without evidence have the additional date and contributor annotations, but those with evidence do not. Is this correct?

dosumis commented 6 years ago

Re-running with evidence individuals fixed. Consistent with previous comment, I haven't added annotations directly to edges that are redundant with those on evidence individuals attached to those edges. I'll upload these models to a new branch/pull request. I have a branch edit that adds these additional annotation axioms so can switch quickly if necessary.

kltm commented 6 years ago

Sorry to run back through your notes this way; I wish there was threading, maybe we should split these out into separate issues?

For this model, there is no occurs_in evidence.

Fixed, but note that it's potentially misleading as there's one set of evidence per model and not all evidence for primary annotation will apply for occurs_in edge.

Yes, this is interesting--I hadn't fully realized that is what was going on here. We currently have no way of stating evidence for a model or individuals in it. In fact, it was a conscious choice that we tried and backed away from early on. I would hazard that it would be better to only have evidence where is was certain and leave "blanks" rather than have over-zealous application of evidence (which could cause oddities later on). @cmungall , any thoughts here?

While not necessarily an issue, I think it may be worthwhile considering the use of special group (providedBy) identifiers for bulk import model data sets so that end users can filter for their inclusion >> and exclusion. These SynGO models would only be reliable filterable on their current title. I would >>recommend a specific or additional providedBy.

Makes sense. Just need to agree on one.

I believe that the only thing we'd need for this is something in groups.yaml (I believe you may already have it here: https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml#L34) and either replace or add in addition to the current geneontology.org.

The model titles are not human readable, making (re)discovery and use possibly difficult.

They are designed so that SynGO curators can move easily between their curation tool and Noctua. I guess we could potentially overload the title field with SynGO model number + short title, but I don't think SynGO has the text for this. @ftwkoopmans ?

I noticed that there was text in the model-level comment annotation. Could that possibly be re-used or shortened?

Edges between individuals contain no date (required), contributor (required), or providedBy (optional, soon required) information.

@kltm Just to check: In addition to the date and contributor on the evidence on each edge (object property assertion), you want a dc:date {some date}, dc:contributor { some source} as annotation axioms on the each edge (object property assertion) ? Let me know ASAP as very easy to add and re-run conversion.

I believe that is correct: every object property assertion (edge) and individual within a model (and the model as well) gets date, contributor, and providedBy. There should be nothing without these annotations.

Generally tagging/invoking @cmungall to check the modeling above.

cmungall commented 6 years ago

As I understand each syngo model contains exactly one GPAD assertion, so we take "one set of evidence per model" as being equivalent to a set of evidence on the "primary" edge in the model.

For title how about generating based on species + gene + primary term, plus the id?

dosumis commented 6 years ago

As I understand each syngo model contains exactly one GPAD assertion, so we take "one set of evidence per model" as being equivalent to a set of evidence on the "primary" edge in the model.

One primary term & one evidence set per syngo model, but there can be multiple extensions. It is possible for there to be multiple syngo models in one noctua model (generated so that can be linked manually later), but in that case there will be one evidence set for each syngo model. Given that, should I remove evidence from the extension?

For title how about generating based on species + gene + primary term, plus the id?

Seems reasonable & easy to implement. @ftwkoopmans - any opinion? Model IDs will still follow SynGO (although full URI will have additional UUID).

dustine32 commented 6 years ago

I was able to clone this project and run it on command-line sbt using the provided SynGO_export_test.json file as input. Here's my attempt to see whether the most recent code addresses these issues:

Date and contributor missing on edges These are on individuals but not axioms. Should be easy to fix.

Evidence missing on occurs_in from extensions Looks like David fixed this.

“contributor” - pulling username (“SynGO-*”) from json instead of ORCID. Waiting on SynGO to provide ORCID’s in export file. Or provide username-to-ORCID mapping?

“providedBy” - Still using "SynGO-VU", which is hard-coded in syngo2lego. Need to pick something that’s existing in groups.yaml. This could get added to groups.yaml but still need to format into URI (like “www.geneontology.org”).

Model titles - Currently SYNGO###. From @cmungall: “For title how about generating based on species + gene + primary term, plus the id?” So format would be: {Species}{UniProtID}_{goTerm}SYNGO{###}? Ex. “Mouse_Q9EQZ7_GO:0016082_SYNGO_113”

Evidence references on part_of edge won’t display right: As mentioned in syngo2lego issue #1, I think this is due to evidence individuals using “source” to specify PMID instead of SEPIO:0000124 term. Although, the same set of evidence (and thus individuals) display correctly on the enabled_by edge. Also, the regenerated model displays part_of references correctly on my server even though the “source” property is still used for PMID. Not sure yet what change fixed this. In 1/18/18 commit to SimpleModel.scala, it looks like David switched from reusing the same axiom for different edges to creating distinct ones (by calling gen_annotations). I can change it to use SEPIO anyways, making sure it doesn’t break again after.

For the example model, I wasn't able to find input json containing SYNGO_1371. Is this in a repository somewhere?

Thanks!

-Dustin

kltm commented 6 years ago

@dustine32 I'm happy to look at any model PR--feel free to put a PR on dev and we can try and get those up. Unfortunately, I'm not sure of the location of any upstream input JSON, you'd probably have to ask @dosumis ?

dosumis commented 6 years ago

Hi @dustine32,

Thanks for working on this. I may not have updated the source JSON file on the repo. Will fix that. Don't see any pull request yet, but more than happy to review.

thomaspd commented 6 years ago

I was talking to the SynGO group earlier this week, and they plan to submit another bolus of annotations in the next couple of weeks.

dustine32 commented 6 years ago

@dosumis Thanks! I haven't put in a PR to syngo2lego yet because I've yet to change any code. I was more just making sure that I could run it and wrap my head around the current state of things. I did put a PR on noctua-models/dev so @kltm could take a look at some of syngo2lego's output when I ran it. It looked like some of the issues were fixed when comparing models currently in noctua-models/dev and the regenerated models.

@thomaspd Great! I wonder if the file source for these annotations is in the same JSON format (could've even fixed the ORCID issue themselves!).

dosumis commented 6 years ago

@dosumis Thanks! I haven't put in a PR to syngo2lego yet because I've yet to change any code. I was more just making sure that I could run it and wrap my head around the current state of things. I did put a PR on noctua-models/dev so @kltm could take a look at some of syngo2lego's output when I ran it. It looked like some of the issues were fixed when comparing models currently in noctua-models/dev and the regenerated models.

Yep. IIRC , I fixed everything everything except model titles and the SEPIO thing (which I wasn't aware of, did this change?)

@thomaspd Great! I wonder if the file source for these annotations is in the same JSON format (could've even fixed the ORCID issue themselves!).

Should be same JSON. There's also a quick check of the json they provide via json schema.
@thomaspd - would be good to emphasise that we want orcids this time.

thomaspd commented 6 years ago

OK, now I see it's the contributor that needs to be the orcid. I will talk to them.

dosumis commented 6 years ago

Simple validation of their json using json_schema is here: https://github.com/geneontology/syngo2lego_data_conversion

travis checks any JSON dropped into the json folder against the schema in spec.

orcid goes in model.username, which is where they put initials now.

Last dataset provided by VU is here: https://github.com/geneontology/syngo2lego_data_conversion/blob/master/json/SynGO_export_2017-12-18_production.json

dustine32 commented 6 years ago

Rad! Thanks @dosumis and @thomaspd !

kltm commented 6 years ago

@dustine32 Looking at latest model batch, basing primarily on http://noctua-dev.berkeleybop.org/editor/graph/gomodel:SYNGO_1371 :

Great things:

Things that can still be worked on?

I think this is great and hope we can get these into production soon--these are a huge improvement in every way and I think easily meet all the required criteria for a "production" model. Thank you!

kltm commented 6 years ago

Assuming no action on the title, I guess the next step would be a PR against the master branch. Once opened, the merge would have to be timed so that we could take the production site offline for a few minutes to rebuild.

dustine32 commented 6 years ago

Thanks much @kltm ! Just put in the PR to master. But now that I look at the title issue again, I suppose I may not have officially asked @ftwkoopmans what his opinion on title generation was. @thomaspd @huaiyumi I know this came up in our meeting this week as well. @cmungall 's suggestion from above was:

For title how about generating based on species + gene + primary term, plus the id?

kltm commented 6 years ago

@dustine32 A title would be an awesome bow on the PR package. I'd hope that anything would be an improvement on the current.

dustine32 commented 6 years ago

@lpalbou Hey Laurent, here's the ticket that tracked some things needing fixing in the SynGO models. Let me know what field ("oboInOwl" something) and what value should be added to the model files and I can plug away. An example would help for sure.

kltm commented 6 years ago

@dustine32 thinks this is cleared, as do I.