COMCIFS / cif_core

The IUCr CIF core dictionary
15 stars 9 forks source link

Do we need a `_dictionary.DOI` attribute? #416

Closed jamesrhester closed 1 year ago

jamesrhester commented 1 year ago

We are planning to assign a DOI to each dictionary as well as each version of each dictionary. Is there any point including this DOI within the dictionary itself, e.g. through a _dictionary.doi tag?

Reasons why it would be pointless is that, if you have the actual dictionary already, the DOI doesn't give you anything new. Reasons why it might be reasonable is that the DOI could point to a landing page that provided more background information, (in which case the DOI is not really the DOI of the dictionary file itself, however.)

Similar arguments apply to _dictionary.url, so I wonder if that is necessary as well.

rowlesmr commented 1 year ago

It does, however, point to the canonical dictionary, so any edits that may have been introduced can be found out (assuming that the doi/url hasn't been changed).

It also embeds where you found it from, so a period of time later, you can find it again.

jamesrhester commented 1 year ago

So you are talking about an overall DOI rather than a version specific DOI?

publcif commented 1 year ago

As https://doi.org/10.1107/cifdic_000002 is a URI, that could be used as the _dictionary.uri ?

Certainly it is handy to have a version-specific uri in the dictionary for the reasons @rowlesmr gave ...

vaitkus commented 1 year ago

I would say that a version specific DOI/URL is definitely useful and should be included in the dictionary for several reasons, some of which have already been mentioned by @rowlesmr. Furthermore, dictionary URI may be used in dictionary import statements (_import_details.file_id attribute).

Finally, having a DOI inside the dictionary is useful for the same reason it is useful to have a DOI inside a paper -- it makes it easier to cite and reference the work.

jamesrhester commented 1 year ago

Right, so do we want DOI in addition to URI? Or is URI enough?

vaitkus commented 1 year ago

I am happy with an URI as long as it remains stable and supported. Historically, IUCr has been quite good at keeping such things stable.

nautolycus commented 1 year ago

My thoughts on this are influenced by the draft I am currently writing for a chapter of International Tables G (below). The draft assumes the adoption of _dictionary.DOI embedded within the dictionary, that would match the value of _cifdic_dictionary.DOI in the register. Note how I distinguish between 'landing_page' and 'dictionary' in _cifdic_dictionary.DOI_type. In this approach, the value of _dictionary.DOI within the dictionary would then be version-specific.

In case it's not clear, the current proposal for minting DOIs that Chester Editorial Office favours is to assign a DOI to each dictionary landing page (e.g. http://www.iucr.org/resources/cif/dictionaries/cif_core). These will provide links to each archived version of the corresponding dictionary on the IUCr web/ftp sites. However, each individual version (that we publish) will have its own DOI. So, clicking on the CrossRef resolver link to https://doi.org/10.1107/cifdic_000001 takes you to the (human-readable) landing page; but clicking on https://doi.org/10.1107/cifdic_000002 will take you to the dictionary file, in this case https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_3.2.0.dic . [I believe IUCr is currently in the process of registering these real DOIs with CrossRef.]

Formally approved dictionaries are published from the network services of the IUCr. The CIF section of the IUCr web site provides links to current and archived versions of approved dictionaries, with commentaries and change logs.

Since the dictionaries are machine-readable resources, it is of course useful for software to be able to download them directly from a known uniform resource locator (URL). The dictionaries are currently distributed over two network protocols, ftp and https. The use of the File Transfer Protocol (ftp) goes back to the release of the original dictionary in 1991, when that was the standard means of transferring files across the then still relatively young academic Internet. When the World Wide Web was launched around 1994, web browsers rapidly became the application of choice for retrieving distributed information. For many years, most popular web browsers supported ftp natively, so that the user could visit a URL with an ftp scheme (e.g. ftp://ftp.iucr.org/pub/cifdics/cif_core.dic) and view the contents immediately in the browser. However, support for this protocol was dropped by the early 2020s, and so the IUCr now also allows for transport over the secure hypertext transmission protocol (https).

COMCIFS maintains a register of dictionaries known to it, including the identifying name and version strings within those dictionaries. In addition to COMCIFS-approved dictionaries, there are a number of dictionaries used internally by other organizations or users that are known to be properly constructed, so that this register has the potential to be a central resolver for any public dictionary. The register includes the location of each dictionary, expressed (where appropriate) as ftp: and https: based URLs. The location of the register is https://www.iucr.org/__data/iucr/cif/dictionaries/cifdic.register and (to maintain compatibility with ftp-based applications) ftp://ftp.iucr.org/pub/cifdics/cifdic.register.

The IUCr makes every effort to retain published URLs indefinitely, but changes are sometimes forced by external circumstances (e.g. the dropping of native support for ftp transfer by browsers, or the expiry of a registered domain name). Consequently, since 2023, digital object identifiers (DOIs) have also been introduced for dictionaries. A DOI is a persistent identifier that can be resolved to an Internet location using a resolver service, thus allowing for changes in the end-point URL to be handled transparently within the resolver service. Registered DOIs are also included in the dictionary register.

Table 4.1.2.1 shows some extracts from the current register. Note the convention that the DOI for dictionary entries that do not have an explicit version number takes the user to a landing page where specific versions may be selected. However, the dictionary URLs (for both https and ftp schemes) download the current version of the dictionary file.

Table 4.1.2.1. CIF dictionary register (maintained as a CIF-format file). The https URLs have been abbreviated to fit into the column width. The elided part of the address is __data/iucr/cif/dictionaries}}

data_validationdictionaries loop _cifdic_dictionary.name _cifdic_dictionary.version _cifdic_dictionary.DDL_compliance _cifdic_dictionary.reserved_prefix _cifdic_dictionary.date _cifdic_dictionary.URL _cifdic_dictionary.URL_ftp _cifdic_dictionary.DOI _cifdic_dictionary.DOI_type _cifdic_dictionary.description #####################################################

COMCIFS approved dictionaries

##################################################### cif_core.dic . 1.4.1 . . https://www.iucr.org/.../cif_core_2.4.5.dic ftp://ftp.iucr.org/pub/cifdics/cif_core.dic https://doi.org/10.1107/cifdic_000001 landing_page 'Core CIF Dictionary' . . . . . . . . . . cif_core.dic 2.3.1 1.4.1 . 2005-06-27 https://www.iucr.org/.../cif_core_2.3.1.dic ftp://ftp.iucr.org/pub/cifdics/cif_core_2.3.1.dic . . 'Core CIF Dictionary as published in ITG edition 1' . . . . . . . . . . cif_core.dic 3.2.0 4.1.0 . 2023-05-30 https://www.iucr.org/.../cif_core_3.2.0.dic ftp://ftp.iucr.org/pub/cifdics/cif_core_3.2.0.dic https://doi.org/10.1107/cifdic_000002 dictionary 'Core CIF Dictionary' . . . . . . . . . . mmcif_std.dic 2.0.09 2.1.6 . 2005-06-27 https://www.iucr.org/.../cif_mm_2.0.09.dic ftp://ftp.iucr.org/pub/cifdics/cif_mm_2.0.09.dic . . 'Macromolecular CIF Dictionary (ITG edition 1)' ####################################################

Private dictionaries (re)distributed by the IUCr

#################################################### . . . . . . . . . . cif_iucr.dic 1.2 1.4.1 . 2014-07-09 https://www.iucr.org/.../cif_iucr_1.2.dic ftp://ftp.iucr.org/pub/cifdics/cif_iucr_1.2.dic . . 'IUCr private data items for journal publishing' . . . . . . . . . . ####################################################

DDL dictionaries

#################################################### . . . . . . . . . . mmcif_ddl_2.1.6.dic . 2.1.6 . 2004-04-15 https://www.iucr.org/.../mmcif_ddl_2.1.6.dic ftp://ftp.iucr.org/pub/cifdics/mmcif_ddl_2.1.6.dic . . 'Relational (DDL2) dictionary definition language'

DDLm.dic 3.14.0 3.14.0 . 2019-09-25 https://www.iucr.org/.../DDLm_3.14.0.dic ftp://ftp.iucr.org/pub/cifdics/DDLm_3.14.0.dic . . 'Methods dictionary definition language' . . . . . . . . . . ####################################################

Data items in the CIF dictionary register itself

#################################################### cif_register.dic . 1.4 . . https://www.iucr.org/.../cif_register.dic ftp://ftp.iucr.org/pub/cifdics/cif_register.dic . . 'Data items used within the register of published CIF dictionaries' cif_register.dic 1.0 1.4 . 2005-06-24 https://www.iucr.org/.../cif_register_1.0.dic ftp://ftp.iucr.org/pub/cifdics/cif_register_1.0.dic . . 'Data items used in CIF dictionary register'

nautolycus commented 1 year ago

Further follow-on rumination on identifiers/locators. These are ways of identifying and locating a "resource". But what is the resource? Can be three things: [a] a specific dictionary file (fixed version/content) [b] the current dictionary file (frequently changing content/version number) [c] "the dictionary" - explanation of what it is, links to any published version, html and pdf formatted representations, links to ancillary files etc. The new core dictionary on the verge of IUCr release is cif_core.dic version 3.2.0. It will have two URLs: https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_3.2.0.dic ftp://ftp.iucr.org/pub/cifdics/cif_core_3.2.0.dic These are both, strictly, URLs to resources of type [a] in that they specify locations for the specific dictionary file. We haven't adopted a formal URI convention for CIF dictionaries heretofore, but we could argue that a suitable URI could be ftp://ftp.iucr.org/pub/cifdics/cif_core.dic That is, it provides a unique identifier for the (version-unspecified) resource "core CIF dictionary". In practice I have symlinked this to the latest dictionary version on our ftp server, so it also acts as a URL for the current version. This could then be a URI for a resource of type [b]. As a URI, it doesn't have to be permanently accessible through that URL (and, indeed, is no longer reachable by a browser link). People might prefer to move to https to help users who rely on browsers. The current URLs are cumbersome, perhaps ugly, and not guaranteed to be long-term stable (the "__data" component was a requirement of our current content management system, which we will move from in the not too distant future). If https were preferred, we should liaise with Chester to see if they can establish easily-maintainable permanent URLs which would still play well with the current and any future CMS. The current plan for DOIs provides for both type [a] and type [c] access. I suppose it wouldn't be too difficult to mint an additional DOI (for each dictionary) that acts as a URI for type [b]. One advantage would be that the underlying URL to the file location on the IUCr web site can be changed transparently when the CMS changes. One disadvantage of using a DOI-based address as a URI is that the responsible authority (iucr.org) is replaced by doi.org - perhaps of no great consequence, but I think it's important to have the sponsoring authority visible. Note that CrossRef have in my view muddied the waters by recommending that DOIs be presented as URLs/URIs with the https://doi.org/ scheme+host prefix. I would (personally) prefer that _dictionary.DOI be given as 10.1107/cifdic_000002 etc., but that does run counter to CrossRef guidance and almost-universal practice. So a possible alternative to the previously posted proposal might be as given below (dropping some attributes for clarity). The first entry (the landing page) doesn't have an ftp equivalent location. Strictly, that doesn't rule out assigning it a URI with an ftp: prefix, but if people treat these as hyperlinked addresses, it could lead to confusion. Another alternative is to keep the DOI as the pure identifier string, but to make e.g. https://doi.org/10.1107/000001 the value of _cifdic.dictionary_URI. Not all historic versions will have DOIs (but then, we haven't previously associated the concept of "URI" with them either). Then we either lose the ftp addresses completely (which I'm reluctant to do, since they have appeared in published material and we can still support those addresses), or we reintroduce _cifdic_dictionary.URL_ftp Your thoughts and comments welcome. If we reach an agreed conclusion, the practice for any _dictionary.* matching data names should follow the same principles. data_validationdictionaries loop _cifdic_dictionary.name _cifdic_dictionary.version _cifdic_dictionary.date _cifdic_dictionary.URL _cifdic_dictionary.URI _cifdic_dictionary.DOI _cifdic_dictionary.resource_type _cifdic_dictionary.description #####################################################

COMCIFS approved dictionaries

##################################################### cif_core.dic . . http://www.iucr.org/resources/cif/dictionaries/cif_core . 10.1107/cifdic_000001 landing_page 'Core CIF Dictionary'

cif_core.dic . . https://www.iucr.org/__data/.../cif_core.dic ftp://ftp.iucr.org/pub/cifdics/cif_core.dic 10.1107/cifdic_000099 dictionary 'Current version of Core CIF Dictionary'

cif_core.dic 2.3.1 2005-06-27 https://www.iucr.org/__data/.../cif_core_2.3.1.dic ftp://ftp.iucr.org/pub/cifdics/cif_core_2..3.1.dic . dictionary 'Core CIF Dictionary as published in ITG edition 1'

cif_core.dic 3.2.0 2023-05-30 https://www.iucr.org/__data/.../cif_core_3.2.0.dic ftp://ftp.iucr.org/pub/cifdics/cif_core_3.2.0.dic 10.1107/cifdic_000002 dictionary 'Specific version of Core CIF Dictionary'

publcif commented 1 year ago

I am not entirely sure what the issue is here, but basically giving a dictionary a version-specific DOI seems as robust as it can get, even if the DOI has to be represented as a URI because we dont have a data name for it (yet :-)

If the argument is 'registered DOI versus URI/URL' then I would favour DOI, even if presented as a full URI (after all, that is what happens in practice).

DOIs are persistent and if the IUCr assigns them to a publication, then the IUCr takes responsibility for ensuring their persistence and that they resolve to the location of the object (which in the case of www.iucr.org content is more than likely to change in a year or so).

In the interests of accessibility, this would not be via a protocol that is no longer widely supported (currently ftp is blocked by many browsers - indeed this week an author complained that they could not access ftp://ftp.iucr.org/templates/latex and I had to use wget via commandline to get a copy of the package in order to make a zip archive available - not very user friendly :-)

Obviously a 'landing page' for a domain's dictionary (which I believe will be assigned its own DOI by IUCr's publication systems) could provide links to numerous alternative downloads, but the target will be the same object as versioned by its DOI (if assigned and referenced in its metadata).

I don't think the IUCr will knowingly abandon any 'historic' links to published resources (definitely if registered via DOI).

So when the IUCr registers a DOI for this first official DDLm version of cif_core, I 'd be very happy to see its _dictionary.uri to take the form of https://doi.org/10.1107/cifdic_000002 (or preferably something that better reflects the name and version).

jamesrhester commented 1 year ago

We seem to have drifted a bit. We are not discussing whether or not to assign a DOI to a dictionary, we are discussing whether or not that DOI should be stated within the dictionary itself. As far as I can tell the winning argument on that front is that it makes it easier to cite the dictionary if the DOI is there in front of you when you open up the dictionary.

As there have been no objections to doing this, the next step is to prepare a PR for the new _dictionary.DOI item, which would be informed by @nautolycus comments above. Should it be the landing page, the dictionary in general, or the specific dictionary version? I say the particular version.

nautolycus commented 1 year ago

I say the particular version.

I agree on this specific point.