PROconsortium / PRoteinOntology

Other
13 stars 3 forks source link

Use a standard vocabulary and annotation property for indicating metaclass #150

Open cmungall opened 5 years ago

cmungall commented 5 years ago

PRO has an implicit ontology of metaclasses. These are currently represented by overloading rdfs:comment wth a value like Category=sequence.

This is suboptimal, a the metaclass is not represented as a URI so the user has no way of following this string to see what it means

This is presumably due to early obof limitation, but it is now possible to have arbitrary annotation assertions in obo format.

I propose to use the biolink vocabulary here. For each class add a triple

PR_nnnn bl:category bl:ProteinIsoform or similar

See also https://github.com/biolink/biolink-model/issues/230

But whichever system is used, we need to ontologize the metaclasses. I think the PRO group are ideally placed to do this as they have thought about this a lot, and it deserves to be represented as a computable artefact rather than hidden in comments.

nataled commented 5 years ago

The original intent of the categories was just to have a way of explaining the overall structure of PRO. Other than that, we pretty much use them for internal purposes, and they've evolved accordingly. The simplest way of thinking about them is as subsets. In fact, from time to time we consider turning them into proper subsets, but really there is zero benefit to doing so (they are already internally computable, and no one has ever asked for it). We never really intended to ontologize them; they'd need a lot of work. We do, however, plan to do expose them in some way (that is, take them out of the comments). By the way, there is documentation on what they mean:

https://proconsortium.org/PRO_QA.pdf (specifically, Q4)

Side note: looks like this doc needs updating.

cmungall commented 5 years ago

I don't know how other people use PRO, but it would seem to be really useful to many people. I am not sure I would jump to stating zero benefit.

For example, most of the ontologies I work on that use PRO don't use any species-specific level information. Making the import chain is a pain due to the size of PRO. Ideally PRO would provide downloadable subsets for cuts like this but in the absence of this it's easier to do a SPARQL query based on a predictable property. While we could do this by encoding the comment text in the query this is obviously suboptimal.

I think subsets would be better than the current situation but I think having a dedicated property would be better.

We never really intended to ontologize them; they'd need a lot of work

What would the work be? I mean you already have an implicit ontology that has great documentation in a pdf file. I don't think this needs to be overthought. Just a URI for each concept, included in PRO or an ancilliary ontology.

We will likely end up doing this in biolink model anyway as we need a computable way to distinguish entries in that denote generic forms from variant forms etc