OBOFoundry / OBOFoundry.github.io

Metadata and website for the Open Bio Ontologies Foundry Ontology Registry
http://obofoundry.org
Other
165 stars 204 forks source link

Are there OBO Foundry recommendations for minting DOIs for ontology releases? #146

Closed cmungall closed 4 years ago

cmungall commented 9 years ago

continued from: https://github.com/oborel/obo-relations/issues/83

Having both DOIs and PURLs is potentially confusing. However, there are reasons why people want DOIs as well:

The first is purely sociological/policy: the prefix 'doi' has a particular aura around it that connotes permanence, citability. Not sure much more can be said here. Perceptions can and should be changed, but the fact is those perceptions exist, and people providing information artefacts like ontologies will react accordingly.

The other reasons are technical. I'm not an expert on DOIs and the surrounding ecosystem, APIs, etc. But we can contrast with what we currently recommend and provide with PURLs.

In our policy http://obofoundry.github.io/id-policy.html

We give an example:

http://purl.obolibrary.org/obo/obi/2009-11-06/obi.owl

This is a perfectly good URI for machines conversant with the semantic web stack (let's ignore that it's served off of sourceforge for now). It is not very human friendly. Is the idea to have some kind of ontobee html/xsl trick here? Or is there intended to be a different URL for humans? Or is this not a concern?

Additionally, the id-policy guide doesn't provide any guidance for whether an ontology should provide a URL for the whole package. http://purl.obolibrary.org/obo/obi/2009-11-06/ does not resolve. Should it? If so, what to? Machine readable or human readable? Directory listing? Also, what is the policy for ontologies that use imports? Should these be merged in, as OBI does? Should the imports be to versioned IRIs?

To compare what is on offer with DOIs and current github/zenodo integration, I permitted Zenodo to make DOIs for PORO releases. See the top of the README: https://github.com/obophenotype/porifera-ontology

The DOI for the current release is: http://dx.doi.org/10.5281/zenodo.27230

This corresponds to this: http://purl.obolibrary.org/obo/poro/releases/2015-08-08/poro.owl

(although the DOI includes the snapshot of the full poro repo, whereas the purl is just for the main ontology, which imports modules using the unversioned PURLs).

The two serve different but overlapping purposes. Of course, you would never use a DOI or the http equivalent in an import. And I'm guessing the information systems that use DOIs probably aren't set up to slurp a large OWL file to insect for dc elements. The http doi resolves to something a human can just about navigate and use, in contrast to the purl. The zenodo DOI also has a link to a zip with the whole repo, whereas the purl is a single artefact.

But I can see how many some would want to cite the DOI in addition to the PURL - resolving the URL in a browser gives something more meaningful to a human.

Additionally, Zenodo carries some level of guarantee that you can get at the contents of that version of the ontology at a future date. Despite the promise implicit in the purl namespace, there is no such guarantee (many ontology version purls resolve to sourceforge, which could vanish overnight. Others resolve to google code, which is guaranteed to vanish)

It seems that if we don't want people to mint separate DOIs for versions of ontologies, we have to up our game, and offer something comparable. We have to provide simple to follow instructions and combined with infrastructure (note that if you're already using github and using normal github release mechanisms for your ontology, you get the Zenodo DOIs with a simple flick of a switch on a web page). I assume we also have to work with the publishing and research data communities to both change perceptions and also to ensure that the technical solutions interoperate (something I don't know much about, @mellybelly can comment).

alanruttenberg commented 8 years ago

Neither DOI nor PURL are supposed to be person friendly. They are supposed to be future friendly. Upping our game does not mean switching to DOI's. It is working on building institutions and plans so we can ensure that our PURLs can continue to resolve. The strategies are different. DOIs are not by default resolvable. http://dx.doi.org/10.5281/zenodo.27230 is not a DOI. The DOI is "10.5281/zenodo.27230" (the recommended print form of this according to wikipedia is doi:10.5281/zenodo.27230 and notes that crossref disagrees and says the print form should be http://dx.doi.org/10.5281/zenodo.27230). The mapping to URL is a courtesy and puts that domain in the same position as purl.oclc.org (single domain point of failure). It is not part of the standard. The advantage of us having our own domain (purl.obolibrary.org) is that we aren't dependent on any particular infrastructure (as the recent episode is showing). For the most part longevity and respectability comes from what the identifiers refer to and, over time, that they remain useful. Not their technology. Yes, DOIs are used in citations. So are URLs.

. http://purl.obolibrary.org/obo/obi/2009-11-06/ does not resolve

There is no defined behavior for that URL. A particular project could do something with it, and if they published it it would work just as well. But we have no guidelines. No point specifying what you don't need.I don't know what you mean by "the whole package". The "package" we produce are ontology documents. If you mean the contents of source code repositories (for those projects that use them - we don't require it) then see below.

Zenodo carries some level of guarantee that you can get at the contents of that version of the ontology at a future date

Zenodo says that, but that doesn't mean that they will accomplish it. I would check what actual mechanisms are in place to ensure it. In any case, organizations are organizations. My own judgement tends to be based on actual performance. Let's check on Zenodo in 10 and 20 years from now. I'm pretty confident that our PURLs can last that long, because I can see 20 years looking back that purls have lasted that long. The domain system, upon which it is based, has been present since 1987 and it seems unlikely that it will go away in the foreseeable future.

Note that the way schemes like DOI argue technology is to say that they are not tied to location. A domain like purl.obolibrary.org is also not tied to location, but by default also can serve useful information on the web whereas DOIs need some auxiliary mechanism (like the mapping to dx.doi.org) in order to serve anything on the web. Also note that what you get back when you resolve a DOI is also not standardized - yes there is the idea of some blob and properties, but there doesn't seem to be uniform policy about what actually comes back. So, with DOI as with a web URL that bit comes from the policy of the minter.

On the issue of having our documents be somewhere we have discussed, in the past, that we should centralized distribution of them, or at least have a policy of archiving them. One way would be to, as part of the ontology release process, send the bits to Amazon Glacier. Any of our ontologies can be saved there for a fraction of a penny. Should a disaster happen we will need to pay to retrieve and redeploy them. We can even use Zenodo as a storage platform, if they are offering bit storage for free, but there is no reason to adopt DOIs as our identifiers, even if we do this.

If we feel that it is in the Foundry's interest to preserve the full contents of repositories then there is work to do to get it into policy and then have mechanism to have it happen automatically - wherever the bits are stored. If Glacier then maybe we're talking a few cents per release instead of a fraction of cent. Another mechanism would be set up p2p with some subset of projects agreeing to host/seed, and/or talk to the internet archive about saving them.

jamesaoverton commented 8 years ago

At the risk of wandering far off topic, I've done some thinking about the problem in general over the years.

In Clojure, for example, the language makes a strict distinction between values and references. Values are immutable things, like the number 3 or the string "This is a string". When a value is a long string, for example, it can be convenient to name the value using a hash. Immutability comes with many nice properties, such as foolproof caching.

References point to values, and they can point to different values at different times. You can track the history of a reference as it points to different values.

That's the Clojure nomenclature, but there are parallels in Git and many other systems that emphasize immutability.

When it comes to our ontologies and terms, we could use string representations of released versions as immutable values and most of our current IDs and PURLs as references to them. We could have a system for tracking the history: given a PURL (i.e. reference), what values has it referred to; given a value, what PURLs have referred to it. These bits of history could be provided in different formats, human or machine readable.

There are many practical problems. The deterministic serialization problem is one, since the same RDF graph can be (and often is) serialized into different strings with different hashes. Working this way also requires a change in mindset.

I don't have a practical proposal, but it's something I think about.

Here are two talks I've enjoyed about the topic:

nlharris commented 4 years ago

Any update on this?

cmungall commented 4 years ago

I think the answer is that some ontologies choose to do this, different OBO people may have different positions on this, but there is as yet no OBO position. If anyone feels strongly they can reopen