COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction
4 stars 4 forks source link

Deprecation of `_pd_block.id`? #56

Open rowlesmr opened 1 year ago

rowlesmr commented 1 year ago

Split off from #39

From @vaitkus speaking on the role of _ps_phase.id and _pd_diffractogram.id:

  1. If one of the intents is to be able to distribute the same dataset in various alternative forms (e.g. as a single CIF file, as multiple files of potentially different data formats) maybe it would also make sense to introduce some kind of a _pd_dataset.uuid data item that would allow to easily link these separate files? If I correctly understand the current approach, the _pd_phase.id data item should generally be used for this purpose, however, I am unsure how well it would work outside of a specific context (e.g. if we encounter the same id in different files outside of an archive file, how sure can we be that they indeed describe the same dataset?). Use of a UUID would greatly reduce the probability of a id clash. The same can probably also be said for the _pd_block.id value format since it uses a timestamp as one of its components, but it does not seem that currently this format is enforced in the descriptions of _pd_diffractogram.id and related items (and potentially this format is not even desired for these fields due to its complexity/lack of brevity).

Potentially heretical thing coming.

From what I've gathered, @briantoby introduced _pd_block_id as a way to link between data blocks, as CIF (at that time) had no way of different data blocks to talk to each other, as everything was essentially single-phase, single-pattern data blocks. Block IDs were used to identify blocks containing either phase or diffractogram, or phase and diffractogram, hence the need for _pd_phase_block_id and _pd_block_diffractogram_id to differentiate between blocks containing phase or diffractogram information.

We now have _pd_phase.id* and _pd_diffractogram.id. If the construction of these ID values is brought inline with the current suggestion for _pd_block.id (ie more "uuid-like"), do we even need block IDs? I'm not knowledgable enough to make the call, but could the move to CIF_2.0 be used deprecate block IDs? To make the dictionary easier to maintain, could we split off deprecated data items into their own dictionary (cif_pow_deprecated.dic ?)**, such that they are still able to be used, but new additions to the dictionary would use _pd_phase.id and _pd_diffractogram.id to link between phase and diffractogram***.

.

* Yes, _pd_phase_id existed previously, but it was only there to link between _pd_phase_block_id and _pd_refln.phase_id.

** Maybe it can be auto-included, like cif_core does with definitions in things like Miller indices...?

** we will need to make something like _pd_phase.short_id and _pd_diffractogram.short_id, which have block-scope (rather than being unique) to link between things like _pd_refln.phase_id, _pd_pref_orient_March_Dollase.diffractogram_id, etc. so that we don't have to have huge, UUID-like values everywhere. In this example, _pd_phase.short_id would be an alias for _pd_phase_id. There would need to be a way to loop _pd_phase.short_idwith _pd_phase.id in a datablock to link the two together.

rowlesmr commented 1 year ago

for the comment "**" above, there is an idea here: https://github.com/COMCIFS/Powder_Dictionary/issues/71#issuecomment-1374419423

jamesrhester commented 1 year ago

I think this is really a policy decision for the pdCIF community. Including block pointers is harmless and potentially useful, but may be adding additional complexity that is never used in practice. Probably best to leave it as a low-priority optional addition, and if someone wants them enough I have no issue in someone else doing the work to add them to the dictionary. Indeed such work could be automated if diffractogram_id and phase_id data names are present in the dictionary.

This issue is as good a place as any to discuss I suppose.

rowlesmr commented 1 year ago

The only reason that I mention it is that if it stays, we need to maintain two different ways of doing the same thing; block_id and phase&diffractogram_id, and then things get murky.

New categories that need to refer to different places will need to have _pd_category.diffractogram_block_id, _pd_category.diffractogram_id, _pd_category.phase_block_id, and _pd_category.phase_id. What do we then make the category key? the phase_id or the block_phase_id? do we need to have dREL to autopopulate one or more of these keys? which takes primacy? Do implementers have to then implement two ways of linking dataitems in pdCIF?

I can see the appeal in the block id as to recording the history associated with a block, assuming you follow the suggestion, but I don't know how it is to work in a system that isn't supposed to be block-aware (not that I know much about that in the first place).

jamesrhester commented 1 year ago

New categories that need to refer to different places will need to have _pd_category.diffractogram_block_id, _pd_category.diffractogram_id, _pd_category.phase_block_id, and _pd_category.phase_id. What do we then make the category key? the phase_id or the block_phase_id? do we need to have dREL to autopopulate one or more of these keys? which takes primacy? Do implementers have to then implement two ways of linking dataitems in pdCIF?

Let me answer those questions: The category keys are always the phase_id and diffractogram_id children (and potentially other per-category data names ). dREL can't usually auto-populate the phase_id/diffractogram_id keys as they are the inputs, not the outputs, but could easily auto-populate the block pointers. phase_id and diffractogram_id child data names have primacy. From a CIF point of view the phase_id and diffractogram_id data names must always be provided, I leave it up to the pdCIF community to assess the value of block pointers going forward. As a member of that community I am clearly shirking my duty to have an opinion here.

I can see the appeal in the block id as to recording the history associated with a block, assuming you follow the suggestion, but I don't know how it is to work in a system that isn't supposed to be block-aware (not that I know much about that in the first place).

Yep, you would end up with a long list of block names and no idea which ones were previous versions of which other ones. If we care we can add a data name "previous_version" or something, which is information that is not strictly available currently as the order of rows in a loop is arbitrary. I don't think this is important.

rowlesmr commented 1 year ago

I'll get in touch with Dave Billing and see if he can send out emails et al.

@briantoby do you have an opinion?

briantoby commented 1 year ago

Historically the motivation of the ids and pointers was to provide a structure that logically connects CIF data blocks in a persistent manner. This was needed because initial practice in the IUCr was to reduce multi-block CIFs to files each with a single block. Block names (as the name following data_...) were considered arbitrary and could be changed by the recipient. Powder diffraction requires the logical connection of multiple CIF data blocks. It became clear later that this was also needed for cases where a set of data might reference a different set of data that one would not want to include in a single file. (An example of this might be where a set of calibration measurements are repeated weekly; each dataset collected in that week would reference the calibration results, but one would not want to include the entire calibration CIF in each collected dataset.)

There are many different ways I could imagine this being done. I am not likely to change the code that produces these links in GSAS-II anytime soon and I do not expect any change will ever be made for GSAS2CIF in GSAS/EXPGUI (and that software suite refuses to die), but I see no reason to object to finding a better mechanism for inter-block referencing this to eventually supersede my initial attempt.

My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?

rowlesmr commented 1 year ago

There are many different ways I could imagine this being done. I am not likely to change the code that produces these links in GSAS-II anytime soon and I do not expect any change will ever be made for GSAS2CIF in GSAS/EXPGUI (and that software suite refuses to die), but I see no reason to object to finding a better mechanism for inter-block referencing this to eventually supersede my initial attempt.

I don't want to remove block ids, so how it currently works will still work. I just forsee issues with adding new features and maintaining two different ways of linking and communicating to users their compatibilities or otherwise.

My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?

Not an answer, but an extension question (probably for @jamesrhester): how do we link, for instance, publication information, to a number of blocks. If there is a "summary" block containing the publication information (from CIF_CORE) or a series of PD_INSTR values , how is this formally linked to the applicable blocks in the collection? Does the mere presence of a _pd_diffractogram.id in a summary block do this job? or is there needed a formal data item?

jamesrhester commented 1 year ago

My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?

Not an answer, but an extension question (probably for @jamesrhester): how do we link, for instance, publication information, to a number of blocks. If there is a "summary" block containing the publication information (from CIF_CORE) or a series of PD_INSTR values , how is this formally linked to the applicable blocks in the collection? Does the mere presence of a _pd_diffractogram.id in a summary block do this job? or is there needed a formal data item?

The mere fact of aggregation (in whatever form: multi-block file, tgz file, directory, ...) is sufficient to indicate that blocks belong together. This is a minimal solution. The actual relationship of the contents of the blocks is determined by the relationships of the data names. Where actual relationships exist ("data X relates to experiment/sample/diffraction conditions/... Y") then that should be reflected in key data name relationships regardless of whether or not block pointers exist. So regardless of block pointers to diffractograms, the key data name relationships with _pd_diffractogram.id have to be explicit as that is no more than a statement of the truth.

The PUBLN question is interesting, as it forces us to ask what the publication relates to. In the old 1990s Volume C view, it related to a particular structure. In the modern day, it may relate to a whole family of structures contained in multiple data sets. So one answer might be that the publication refers to a list of data set identifiers. Such data sets could be composed especially for the publication so that all contributing calibration data etc. are included. These compositions might even have a DOI minted externally that is never embedded in each data block. But let's not solve this problem now as I think there are more pressing issues.

rowlesmr commented 1 year ago

PUBLN was just an example to illustrate "a whole family of structures contained in multiple data sets", not posing a specific issue. I don't really want to solve it now :)

rowlesmr commented 1 year ago

Just parking this here as some discussion around deprecation:

If block IDs are deprecated*, pdCIF will provide only a single way of linking data, and new additions to the dictionary need support only this one way. This will simplify reading, writing, and interpreting CIF files utilising pdCIF. Old users of pdCIF will need to update their data output methodologies to account for this change if they want to use new dictionary features; if they confine themselves to using pre-existing data items, then it should be possible to still use block pointers (which then leads to pdCIF programmers). pdCIF software writers could support both methods of linking to maintain maximum compatibility; they could also choose to only support non-deprecated features. Finally, deprecating block IDs will align pdCIF completely with the data linking methodology within the core CIF dictionary.

Keeping block IDs will mean that there are two ways of doing things (not ideal in a standard), but it will maintain continuity with previous methodologies. New additions to the dictionary will need to support different methods of linking. It may also become more confusing to (new and old) users of pdCIF, and lead to inconsistent linking between data.

.

*Deprecation, in this sense, is the process of taking older CIF data items and marking them as no longer being useful within the dictionary. This done when the information which a data item conveys has been superseded by a newer data item or a newer dictionary concept. As CIF is an archive format, the deprecated data item is not removed from the dictionary, as this would invalidate existing pdCIF files.

publcif commented 1 year ago

@rowlesmr mentioned:

Finally, deprecating block IDs will align pdCIF completely with the data linking methodology within the core CIF dictionary.

Please clarify what you mean by core CIF 'data linking methodology' (_audit.block_code, _audit_link.block_code, or ... ?)

rowlesmr commented 1 year ago

To my mind, it's formal links to other data items via _pd_phase.id and _pd_diffractogram.id. It also allows the _audit* data items to be used.

@jamesrhester will be able to give the nitty gritty.

jamesrhester commented 1 year ago

The "data linking methodology" for core CIF is found here. Note that it does not explain how to locate all of the blocks belonging to a dataset, it leaves that to the context (Rule 1). These principles simply ensure that blocks can be synthesised together into a coherent whole via the use of linked category keys.

The discussion here was an attempt to develop a comprehensive way of linking data blocks, but did not succeed.

rowlesmr commented 1 year ago

Raising this one again. I've not heard anything from anyone about it. I've reprompted MH @ IUCr for thier opinion.

rowlesmr commented 1 year ago

My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?

just another thought - block ids only work for phases and difffractograms, as there are only dataitems that take block ids for this purpose.

rowlesmr commented 1 year ago

@publcif @briantoby @jamesrhester @vaitkus

Is there any more input on deprecating block IDs?

My main reasoning behind wanting to do this is:

If block IDs are deprecated*, pdCIF will provide only a single way of linking data, and new additions to the dictionary need support only this one way.

One way of doing something is almost always better.

I do need to chase up with MH from IUCr.

briantoby commented 1 year ago

I can't make a recommendation here without spending a lot more time reviewing dictionaries than I have available at the moment, but if _pd_phase.id and _pd_diffractogram.id are constructed in sort-of-unique ways, as was intended for block IDs, then that would serve the purpose for references between data & phases. Then indeed block IDs would not be needed for that purpose. OTOH, (and I don't have an answer for this,) Is that the only place where inter-block pointers are needed? As one example of a block that is neither fish nor fowl (so to speak), I sometimes create CIFs with a summary block -- no data & no phases -- for example in the case of a sequential fit, where I have overall results for a large set of fits.

publcif commented 1 year ago

Establishing relationships between blocks in multi-block CIFs is an important part of IUCr Journals CIF processing, especially for pdCIF and msCIF. Anything that can assist in this task is welcome. So as long as any new alternative mechanisms serve the same purpose, we will add them to our processing software where appropriate.

Bearing in mind that we receive CIFs which may be a concatenation of other CIFs, my main concern is that recognizing relationships between datablocks could be more problematic if 'the mere fact of aggregation (in whatever form: multi-block file, tgz file, directory, ...) is sufficient to indicate that blocks belong together'. But that concern is equally applicable to any block-linking ID (which is why I use UUIDs in my CIF applications when I need to store multiblock relationships, e.g. in 'summary blocks' such as @briantoby refers to...)

jamesrhester commented 1 year ago

I can only repeat what I've said above. In short:

  1. Logical links between data items in a collection of blocks occurs through linked identifier data names and their values. This happens regardless of the existence or otherwise of block pointers.
  2. Which blocks constitute the collection of blocks is not currently solved within CIF. Discussion around this failed to reach a conclusion.

If UUIDs are used for diffractogram/phase/specimen/sample etc. identifiers, point (2) becomes irrelevant, as blocks can be grouped by which ones contain common identifier values for linked data names. However, readability suffers. As @briantoby says, a version of the block ID construction mechanism that is readable and "probably" unique seems like a good way to get around this. This would be enforced not by the CIF mechanics (e.g. a data type), but by recipients, so that if a single-block CIF is the entire data set, there is no need for fancy identifiers.

For example, we might suggest that all IDs are at least 6 non-whitespace characters. This should lower the chances of accidental collision. So, given a bunch of data blocks, the recipient (eg @publcif) might give an error or warning if identifiers that might have common values between data blocks have less than 6 non-whitespace characters in them, on the basis that there could be an accidental collision of identifiers.

publcif commented 1 year ago

Setting aside any general concerns with identifying multiblock relationships, I think it safe to say that IUCr Journals would not have a problem with this deprecation. After all, 'deprecation' is not 'prohibition'.

rowlesmr commented 1 year ago

OTOH, (and I don't have an answer for this,) Is that the only place where inter-block pointers are needed? As one example of a block that is neither fish nor fowl (so to speak), I sometimes create CIFs with a summary block -- no data & no phases -- for example in the case of a sequential fit, where I have overall results for a large set of fits.

is kind of answered by

2. Which blocks constitute the collection of blocks is not currently solved within CIF. Discussion around this failed to reach a conclusion.

Currently, given a group of data blocks, there isn't a way to tell which ones belong together, apart from them being presented together.

.

To my mind, the fundamental* id values are _pd_phase.id, _pd_diffractogram.id, and _pd_meas.detector_id (see #138). Also, to a lesser degree, _pd_char.id, _pd_prep.id, and _pd_spec.id (see #127). All of these are Text (case-sensitive, with whitespace), except _pd_meas.detector_id, which is Code (case-insensitive, no whitespace).

All categories, except for PD_CALIB_INCIDENT_INTENSITY, PD_CALIB_XCOORD, PD_INSTR, and PD_PEAK(_OVERALL), have a category key data item derived from one of those first three fundamental data items.

PD_CALIB_XCOORD doesn't really count, as it has _pd_calib_xcoord.overall_id, and PD_CALIB_XCOORD_OVERALL links to a diffractogram. PD_PEAK(_OVERALL) needs to be modified, most likely to follow the PD_CALIB_XCOORD(_OVERALL) model**. PD_CALIB_INCIDENT_INTENSITY should have a _pd_calib_incident_intensity.instr_id (see https://github.com/COMCIFS/Powder_Dictionary/pull/92#issuecomment-1493593082), and PD_INSTR should probably be promoted to be the fourth fundamental id value.

_pd_char|prep|spec.id (will (see #127)) link to each other, and there is _pd_diffractogram.spec_id to link diffraction pattern to specimen.

A full, 128-bit GUID gives 10^49 unique values. Eight case-sensitive, nonwhitespace ASCII characters is 10^13. Eight unicode characters is a lot***. As @jamesrhester has said, we can strengthen the descriptions of the fundamental ids (and probably also the spec-related ids) to be at least 8 nonwhitespace characters - one way to do this is to append the beginning of an GUID, e.g. "Corundum_749fa649".

* That is, the category keys of all other categories in pdCIF are created in combination with at least one of these keys. ** No. They all need diffractogram_id category keys. *** Imagine iding your phases using only emoji...

rowlesmr commented 1 year ago

'deprecation' is not 'prohibition'.

Yes. It's saying, at least, you shouldn't use these. I don't know what the policy is on removing data items from dictionaries, but one would imagine that would also involve a change in the major version number of said dictionary.

rowlesmr commented 1 year ago

Also, is this current thread a starting point for discussion on the uniqueness of id-type dataitems* in CIF, in general?

see also https://github.com/COMCIFS/cif_core/issues/316 and #75

* _type.purpose Key

rowlesmr commented 1 year ago

Could we leverage the current description of _pd_block.id, and generalise it to fit any fundamental id-type? It is a little more formal/complicated than "suggesting that all IDs are at least 6 non-whitespace characters", but its a start.

What to do with fundamental id-like values?

    A case-sensitive character string - the ID code - is assigned to an 
    identifier-like data item where the ID code must be reasonably expected 
    to be unique in the entire world.

    The ID code is assigned by the originator of the data set and is used 
    for references between different CIF data items and categories. The 
    ID code will normally be created when the data set is first created.

    One way which may be used to essentially guarentee uniqueness is to 
    use a Universally Unique IDentifier (UUID), as described by RFC4122.
    However, the human-readability of the CIF suffers with this. This can
    be mitigated by appending a portion of a UUID to a human-readable
    expression. 

    A URI or DOI may be used as an ID code where the relevant item is 
    available online. This should only be done if it can be reasonably 
    expected that the file will be available online indefinitely.

    The ID code is not intended to be parsed, and, as such, the
    concatenation of several strings can be used to generate ID code that 
    can reasonably be expected to be unique.

    One suggested format for the ID code is:
      <descriptor>|<creator_name>|<date-time>

     <descriptor>   a descriptive string assigned by the originator of 
                    the data set. This could be the name of the instrument
                    which collected the data, the phase name, or the
                    identity of the material.
     <creator_name> is the name of the person best associated with
                    the creation of the values to which the identifier
                    belongs. This could be the person(s) which calibrated 
                    the instrument, prepared the specimen, or collected 
                    the diffractogram. 
     <date-time>    is the date and time best associated with
                    the creation of the values to which the identifier
                    belongs. This could be when the instrument was 
                    calibrated, the specimen prepared, or the data
                    collected or analysed. Date-time entries follow the 
                    standard RFC 3339 format: 
                    'yyyy-mm-ddThh:mm:ss{Z|[+-]zz:zz}'.

    Some example ID codes are:
       c7eade47-2641-4615-9be6-17a33f354030
       NISTSRM-676A_8cb84ae7
       Si-std_D500#1234-987|B.Toby|1991-09-15T16:54:00Z
       CIF_core|https://raw.githubusercontent.com/COMCIFS/cif_core/master/cif_core.dic|2023-06-10T20:46:37+08:00
       10.5281/zenodo.4415768/sxd_empty_run.nxs
rowlesmr commented 1 year ago

I was just reading multi-block-principles.

Could the ID requirements be relaxed with the introduction of a (currently hypothicated) _audit_dataset.id data item? Every data block which belongs in the same dataset (as defined by the person writing the CIF container) will have an identical _audit_dataset.id data value. You can then go full-UUID on this data item, and all other id-like items need only be unique within this namespace.

briantoby commented 1 year ago

I would say that it does not matter what the name of the tag that gives a block a unique name, as long as one exists. OTOH, what to do if there is more than one such block id provided and they are different? So, it is best to select one name and be done.

rowlesmr commented 1 year ago

I've conflated identifying blocks that belong together (eg ones from the same experiment) with uniquely identifying a single block.

The current multi block guidelines are implicit; if blocks appear together, they belong together. If we introduce a uuid-like identifer to tag disparate blocks as belonging together in that same experiment, then the uniqueness requirements for our other identifers are greatly reduced to only being unique within that same experiment.

@jamesrhester was the multiblock discussion on the cif dmg mailing list? I couldn't find it in the archive.

jamesrhester commented 1 year ago

I've drafted up a summary document around these issues of block collections, as this affects all of CIF. Can we take this aspect of the discussion to an issue around that document?

The inconclusive discussion around how to link datablocks is here.

rowlesmr commented 1 year ago

Just to come back on track (sorry all)

This is what we have to work with:

  1. Logical links between data items in a collection of blocks occurs through linked identifier data names and their values. This happens regardless of the existence or otherwise of block pointers.
  2. Which blocks constitute the collection of blocks is not currently solved within CIF. Discussion around this failed to reach a conclusion.

So, we need unique ID values. Currently, in pdCIF, this means that _pd_phase.id, _pd_diffractogram.id, _pd_instr.id, _pd_instr_detector.id, and _pd_char.id must be unique. AFAIK, everything in pdCIF has some combination of these data items as category keys (as linked items) with or without a category-specific id.

So we have these approaches:

(from @briantoby / @jamesrhester) As @briantoby says, a version of the block ID construction mechanism that is readable and "probably" unique seems like a good way to get around (readability suffering from UUIDs).

(from @publcif) I use UUIDs in my CIF applications when I need to store multiblock relationships, e.g. in 'summary blocks' such as @briantoby refers to...

(from @jamesrhester) we might suggest that all IDs are at least 6 non-whitespace characters. This should lower the chances of accidental collision. So, given a bunch of data blocks, the recipient (eg @publcif) might give an error or warning if identifiers that might have common values between data blocks have less than 6 non-whitespace characters in them, on the basis that there could be an accidental collision of identifiers.

My contribution is a little more formal, and takes heavily from the block_id approach of @briantoby and flavoured by @publcif and @jamesrhester 's suggestions

From above:

    A case-sensitive character string - the ID code - is assigned to an 
    identifier-like data item where the ID code must be reasonably expected 
    to be unique in the entire world.

    The ID code is assigned by the originator of the data set and is used 
    for references between different CIF data items and categories. The 
    ID code will normally be created when the data set is first created.

    One way which may be used to essentially guarentee uniqueness is to 
    use a Universally Unique IDentifier (UUID), as described by RFC4122.
    However, the human-readability of the CIF suffers with this. This can
    be mitigated by appending a portion of a UUID to a human-readable
    expression. 

    A second way is to use a URI or DOI as an ID code where the relevant item 
    is available online. This should only be done if it can be reasonably 
    expected that the file will be available online indefinitely.

    A third way is to concatenate several strings to generate an ID code that 
    can reasonably be expected to be unique.

    One suggested format for the ID code following this last approach is:
      <descriptor>|<creator_name>|<date-time>

     <descriptor>   a descriptive string assigned by the originator of 
                    the data set. This could be the name of the instrument
                    which collected the data, the phase name, or the
                    identity of the material.
     <creator_name> is the name of the person best associated with
                    the creation of the values to which the identifier
                    belongs. This could be the person(s) which calibrated 
                    the instrument, prepared the specimen, or collected 
                    the diffractogram. 
     <date-time>    is the date and time best associated with
                    the creation of the values to which the identifier
                    belongs. This could be when the instrument was 
                    calibrated, the specimen prepared, or the data
                    collected or analysed. Date-time entries follow the 
                    standard RFC 3339 format: 
                    'yyyy-mm-ddThh:mm:ss{Z|[+-]zz:zz}'.

    Some example ID codes are:
       c7eade47-2641-4615-9be6-17a33f354030
       NISTSRM-676A_8cb84ae7
       Si-std_D500#1234-987|B.Toby|1991-09-15T16:54:00Z
       CIF_core|https://raw.githubusercontent.com/COMCIFS/cif_core/master/cif_core.dic|2023-06-10T20:46:37+08:00
       10.1234/5678910/xraydata.xye

I've written it in a case-sensitive, no whitespace manner. I could be convinced on allowing whitespace, but I think keys should be case-sensitive.

rowlesmr commented 1 year ago

And also, now that I've properly read the block collections guidance document from James, my previous post is essentially irrelevant.

@publcif, does James' suggested hybrid approach cater to your approach?