HUPO-PSI / psi-ms-CV

HUPO-PSI mass spectrometry CV
Other
28 stars 38 forks source link

Fix problems with obo file #26

Closed cmungall closed 2 years ago

cmungall commented 4 years ago

There are some additional issues reported here: http://obo-dashboard-test.ontodev.com/ms/dashboard.html

I recommend also

mwalzer commented 4 years ago

@cmungall Hi and thanks for the advice, timing is (almost) fortunate, our psidev-ms-vocab coordinator recently left (due to a job-change). We appreciate the advice and maybe you can guide us setting the right automated mechanisms in place to avoid many (future) problems when we add new content? As a follow-up question, is obofoundry the place to find the PURLs our obo desires and if not exclusively, where to look? And Regarding xrefs and owl, I think this is a legacy issue, since it was easiest to use a simple text format over a markup format. Although the xref is inspired by the way it is used in xsd files. At least that is what I figured - wasn't there when that began. Is there documentation on how obo and/or owl are supposed to be used? Again, any form of help is appreciated!

cmungall commented 4 years ago

Thanks @mwalzer!

I think the best thing to start on is documenting as much as possible (preferably as markdown files managed in this repo) as stated above.

It would also be good to have some technical person at the PSI end able to help with any directions I can give. This need not be too much. No ontology expertise necessary, just some basics of git, running things on the command line, and some github knowledge (e.g. travis).

is obofoundry the place to find the PURLs our obo desires and if not exclusively, where to look

Yes, but you have the right to have the OBO PURLs for MS direct wherever you like. Currently OBO central slurps your file and builds release products, and the PURLs redirect to our S3. See http://obofoundry.org/faq/what-is-the-build-field.html. I am confident that we can work towards a situation where you have a workflow you can run and be in control of this yourself.

Regarding xrefs and owl, I think this is a legacy issue

Would there be negative implications of removing these? What software depends on them?

Is there documentation on how obo and/or owl are supposed to be used?

There is a larger answer here but I think let's focus on having your ontology be browsable in major ontology browsers, be consumable by tool chains that provide services such as metadata annotation etc.

Some of these (older) depend on obo format, newer ones typically depend on owl.

More later, I'm sure we can get this up and running

chambm commented 4 years ago

@cmungall I think we still don't have an official maintainer, but I'd volunteer to get this repo in a state where it can be easily maintained by automatically validated PRs run with Travis (or whatever other CI is easiest). I read about ROBOT in the ODK. Is it actually necessary to use the full ODK repo template, or can we add validation CI with the existing layout, while forgoing the other ODK features like release/merging/editing?

In your repo you will see a README-editors.md file that has been customized for your project. Follow these instructions.

Generally the cycle is to:

branch
edit the edit.owl file
make test
git commit
git push
To make a release:

make prepare_release

Note that any make step can be preceded by run.sh if you have Docker installed:

sh run.sh make prepare_release

Until somebody volunteers to be a real maintainer, I think we'd rather stick with a simpler editing paradigm. Specifically, most edits will be simple addition of a few terms. Anybody should be able to get the hang of it, just editing with the GitHub text editor (another reason to continue using .obo format) and make a PR on this repo. GitHub handles the forking/editing/branching/merging. Travis will validate it, and if it passes, an admin can merge it. The raw obo file itself will effectively be the release (an owl conversion could be provided as well). Probably quite a few issues from http://obo-dashboard-test.ontodev.com/ms/dashboard.html should be addressed (fixed or muted) in the process so that true validation errors are more obvious.

chambm commented 2 years ago

Wasn't this addressed by automatic creation of OWL files by GitHub Actions (#88 )?

edeutsch commented 2 years ago

I think the answer is yes.

cmungall commented 2 years ago

there are still the issues I highlighted above re imports, you should use the correct PURLs, the existing OWL could fail at any time

chambm commented 2 years ago

I think the xrefs for values are indeed a legacy we left in to avoid breaking tools that already depended on it: https://github.com/HUPO-PSI/psi-ms-CV/pull/50

Does anybody know what tools actually do depend on it though, and can we get them to update to has_value_type?

edeutsch commented 2 years ago

I know that I have some in-house PeptideAtlas related code that parses and uses xref:value-type. I suspect there are other various other tools that use it. But if it useful, I do not mind forcing the issue and migrating xref:value-type to relationship: has_value_type in our code to use the new syntax. I have not done so yet, but could do that easily given some motivation.

I do not know any other specific software that uses it. I might guess that the various validators like the mzML validator and mzIdentML validators probably use it. they should use it, but I vague recall that maybe they did not validate value-type properly, so maybe they don't. It would be helpful if others could chime in with that they know.

cmungall commented 2 years ago

Thanks Eric, I know how much work it is to evaluate the impact of changes on widely used standards on a diverse range of software tools, much appreciated!

I think if there is willingness to change, I think it's worth doing it right. Unfortunately the "relationship: has_value_type" fix is not really much better, you are just going from overloading one field to a different field.

For example: https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fxsd_float

Once some information about how the existing MS standard is being used I can make some proposals about how to do something that captures what it is you need to capture without any overloading in a way that can be used in validation. It may be the case that it's most expedient to freeze the existing obo as a legacy artefact and build something using a more fit for purpose framework

On Mon, Apr 25, 2022 at 10:24 PM Eric Deutsch @.***> wrote:

I know that I have some in-house PeptideAtlas related code that parses and uses xref:value-type. I suspect there are other various other tools that use it. But if it useful, I do not mind forcing the issue and migrating xref:value-type to relationship: has_value_type in our code to use the new syntax. I have not done so yet, but could do that easily given some motivation.

I do not know any other specific software that uses it. I might guess that the various validators like the mzML validator and mzIdentML validators probably use it. they should use it, but I vague recall that maybe they did not validate value-type properly, so maybe they don't. It would be helpful if others could chime in with that they know.

— Reply to this email directly, view it on GitHub https://github.com/HUPO-PSI/psi-ms-CV/issues/26#issuecomment-1109357814, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMON5DYDDMEMYSNR3EPDVG54ZNANCNFSM4LFZTUNQ . You are receiving this because you were mentioned.Message ID: @.***>

chambm commented 2 years ago

I think our use of xref using a field with a well-defined purpose for a different purpose (i.e. misusing). But relationship: has_value_type is extending a field that was designed to be extended. Like any extension, consumers must be explicitly designed to handle it since it's not part of the format specification, but also it should not break any consumers that have not been designed to handle it.

So while I'm sure there are other formats with built-in support for type and range properties, I very much doubt that freezing the OBO and switching future releases to other formats would be more expedient. On the other hand, we could continue with the OBO like it is, but add an automatic transformation to one of those formats with built-in support for types, where those built-in properties are populated by the has_value_type extension. How does that sound?

(I still support removing the misuse of xref if it's practical to have consumers switch over though)

cmungall commented 2 years ago

Hi Matt

I am the one to blame for obo format. The relationship tag wasn't designed to be extended any more than xref. Its purpose is to link terms together by a defined relationship type. So the current usage in MS is on the surface not too far off.

The problem here is that

  1. you are treating XSD datatypes as terms (classes)
  2. obo format unfortunately doesn't handle URIs outside the OBO namespace well

This causes all kinds of issues for normal uses of obo format, and in the resulting OWL, see the OLS link

However, this may very well be moot, as PSI software seems to be using the obo files according to specific conventions

On the other hand, we could continue with the OBO like it is, but add an automatic transformation to one of those formats with built-in support for types, where those built-in properties are populated by the has_value_type extension. How does that sound?

I think this sounds quite practical in the short term. I do think that longer term the source of truth should be a more appropriate format with the translation happing in the other direction, to generate the legacy obo format product

On Tue, Apr 26, 2022 at 8:25 AM Matt Chambers @.***> wrote:

I think our use of xref using a field with a well-defined purpose for a different purpose (i.e. misusing). But relationship: has_value_type is extending a field that was designed to be extended. Like any extension, consumers must be explicitly designed to handle it since it's not part of the format specification, but also it should not break any consumers that have not been designed to handle it.

So while I'm sure there are other formats with built-in support for type and range properties, I very much doubt that freezing the OBO and switching future releases to other formats would be more expedient. On the other hand, we could continue with the OBO like it is, but add an automatic transformation to one of those formats with built-in support for types, where those built-in properties are populated by the has_value_type extension. How does that sound?

(I still support removing the misuse of xref if it's practical to have consumers switch over though)

— Reply to this email directly, view it on GitHub https://github.com/HUPO-PSI/psi-ms-CV/issues/26#issuecomment-1109935623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOPYQFY37XX4TMMG5L3VHADE7ANCNFSM4LFZTUNQ . You are receiving this because you were mentioned.Message ID: @.***>

chambm commented 2 years ago

Ah, I didn't actually know that the target of the relationship tag is supposed to be a term. So to avoid the problems you mentioned in the short term, we'd need to make the XSD types into proper terms? Or is there another (small) ontology for computer data types we can import from? I think we used XSD out of convenience, but the main thing we were trying to capture was "string", "real", or "integer". I don't think we are too concerned with ranges, judging by how easy it is to find terms that are using "integer" when they almost certainly should be "nonNegativeInteger".

cmungall commented 2 years ago

On Tue, Apr 26, 2022 at 9:58 AM Matt Chambers @.***> wrote:

Ah, I didn't actually know that the target of the relationship tag is supposed to be a term. So to avoid the problems you mentioned in the short term, we'd need to make the XSD types into proper terms? Or is there another (small) ontology for computer data types we can import from? I think we used XSD out of convenience, but the main thing we were trying to capture was "string", "real", or "integer". I don't think we are too concerned with ranges, judging by how easy it is to find terms that are using "integer" when they almost certainly should be "nonNegativeInteger".

This is on the right lines. However, we can't technically make the xsd types into terms (classes), in OWL-DL (the semantics of OBO are in OWL-DL) datatypes and classes are disjoint. This may seem arcane but there are some good reasons

(note if we were using OWL then we would just model things like 'sample mass' as datatype properties, explicitly assign xsd types as ranges - unfortunately obo is less expressive than owl, and was never intended for the kinds of things you're doing)

So a compromise would be terms that "shadow" xsd types. There are in fact various ontologies that do this: https://bioportal.bioontology.org/search?query=integer&button= -- not to say it's a good solution, just probably the least worst given various constraints of legacy tools, formats, etc. You could have xrefs that map the shadow datatypes to the xsd datatypes (xrefs are semantically silent so whilst unconventional it's not wrong)

I am not sure if there is apetite for an MS2 standard. I would be interested in helping. I would not base this on obo format (but we could have lossy transforms). OWL would be better. But better still would be a language that is explicitly designed for this kind of thing such as LinkML ( https://linkml.io/linkml)... which I am a part of. This would allow you to model things like:

— Reply to this email directly, view it on GitHub https://github.com/HUPO-PSI/psi-ms-CV/issues/26#issuecomment-1110035070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOLCBU4USILLUYCMPLLVHAOCPANCNFSM4LFZTUNQ . You are receiving this because you were mentioned.Message ID: @.***>

edeutsch commented 2 years ago

While I love the idea of the beautifully neat and organized MS2 ontology, I don't have the time to devote to it, and as far as I can tell, neither does anyone else in the collaboration that works on the current one. We barely scrape together enough interest to extend the current one in service of the formats that we develop and maintain. And I am under the impression that the MS ontology is not really useful to the greater ontology community at large. There's no point in Translator bothering to incorporate it, it's too niche, really. So, with no clamoring customers beyond what we already have and no one having time and expertise to devote time to rebuilding it, I don't see it happening. I totally get what you're saying, I don't disagree. I just see us breaking out of our current mode as a practical matter. I'd be thrilled to help someone who wanted to do it. But it's a person-year of effort to rebuild a new MS2 with all new tooling and make it workable for all current uses. Not to mention all existing software that uses the OBO.

chambm commented 2 years ago

Thanks for the insight and recommendations Chris. Note that this OBO started as (and arguably still is) a controlled vocabulary rather than an ontology. Eric is right we don't have the time on both the producer and consumer sides to totally revamp this to a different format with a more formal structure. I think we can go with just adding the XSD types as terms so we're not violating the spec. Is there a way we can do that without changing the term ids we're currently referencing (xsd:float)? Something like:

[Term]
id: MS:xsd\:float
def: "32-bit floating point value." []

Because our default-namespace is MS, we should be ok to leave off the MS prefix with: relationship: has_value_type xsd\:float ! The allowed value-type for this CV term

It might break some parsers that expect the ids to be numeric, but as far as I can tell that's not actually a requirement of OBO (whereas a relationship must refer to an id, not the name of a term; that's less clear for an xref though)

cmungall commented 2 years ago

id: MS:xsd\:float

the double quote will cause all kinds of issues and doesn't really have any advantages as it's only implicitly related to the actual xsd IRI

I would say if you do go this route just make this like any other term and just xref to xsd

Because our default-namespace is MS, we should be ok to leave off the MS prefix with

sorry! doesn't work like that

anyway it sounds like it doesn't really matter either way, MS uses it's own internal conventions with obo files and as Eric says t's not really used outside of some proteomics tools

we should probably mark MS as inactive in OBO if that's OK?

chambm commented 2 years ago

id: MS:xsd:float

the double quote will cause all kinds of issues and doesn't really have any advantages as it's only implicitly related to the actual xsd IRI

I would say if you do go this route just make this like any other term and just xref to xsd

Because our default-namespace is MS, we should be ok to leave off the MS prefix with

sorry! doesn't work like that

I'm confused, isn't https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fxsd_float trying to resolve the un-IDspace-prefixed term inside the MS IDspace? It's a dangling reference, right? If that's not the problem, what is the problem you were trying to point out with that link?

anyway it sounds like it doesn't really matter either way, MS uses it's own internal conventions with obo files and as Eric says t's not really used outside of some proteomics tools

we should probably mark MS as inactive in OBO if that's OK?

What would be the consequences of that? It is sometimes convenient to use the OBO browsers to view or share the structure of the MS OBO, e.g. to share a hierarchy of terms with a collaborator. It's also useful for people who are looking for a CV or ontology for MS to be able to find it in the common ontology repositories.

cmungall commented 2 years ago

On Wed, Apr 27, 2022 at 7:48 AM Matt Chambers @.***> wrote:

Thanks for the insight and recommendations Chris. Note that this OBO started as (and arguably still is) a controlled vocabulary rather than an ontology.

Just a quick aside on this. All my comments are driven by pragmatic concerns such as (1) can proteomics tools get the information they need in a reliable and predictable way now and in the future (2) will existing more generic tools that consume ms.obo and ms.owl (e.g. OLS, OBO Library, various software libraries) do the right thing or at least fail gracefully in ways that don't confuse users.

I'm not interested in being the ontology police or gatekeeping definitions of ontology -- ontology, terminology, controlled vocabulary, metadata standard, it's all a fuzzy continuum, what's useful is what matters.

Message ID: @.***>

cmungall commented 2 years ago

I'm confused, isn't https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fxsd_float trying to resolve the un-IDspace-prefixed term inside the MS IDspace? It's a dangling reference, right? If that's not the problem, what is the problem you were trying to point out with that link?

there are multiple technical issues. It's something that looks like xsd:float, but is not, it has an OBO library PURL that is not resolvable, it's a dangling reference with no information about it... but most of all I think it's just plain confusing to users

we should probably mark MS as inactive in OBO if that's OK?

What would be the consequences of that? It is sometimes convenient to use the OBO browsers to view or share the structure of the MS OBO, e.g. to share a hierarchy of terms with a collaborator. It's also useful for people who are looking for a CV or ontology for MS to be able to find it in the common ontology repositories.

It would still be browsable and show up in OLS, ontobee, etc, this would serve as a marker that it's not intended to be updated to follow standard interoperability rules and shouldn't be imported into other ontologies.