Open rowlesmr opened 1 year ago
for the comment "**" above, there is an idea here: https://github.com/COMCIFS/Powder_Dictionary/issues/71#issuecomment-1374419423
I think this is really a policy decision for the pdCIF community. Including block pointers is harmless and potentially useful, but may be adding additional complexity that is never used in practice. Probably best to leave it as a low-priority optional addition, and if someone wants them enough I have no issue in someone else doing the work to add them to the dictionary. Indeed such work could be automated if diffractogram_id and phase_id data names are present in the dictionary.
This issue is as good a place as any to discuss I suppose.
The only reason that I mention it is that if it stays, we need to maintain two different ways of doing the same thing; block_id and phase&diffractogram_id, and then things get murky.
New categories that need to refer to different places will need to have _pd_category.diffractogram_block_id
, _pd_category.diffractogram_id
, _pd_category.phase_block_id
, and _pd_category.phase_id
. What do we then make the category key? the phase_id or the block_phase_id? do we need to have dREL to autopopulate one or more of these keys? which takes primacy? Do implementers have to then implement two ways of linking dataitems in pdCIF?
I can see the appeal in the block id as to recording the history associated with a block, assuming you follow the suggestion, but I don't know how it is to work in a system that isn't supposed to be block-aware (not that I know much about that in the first place).
New categories that need to refer to different places will need to have
_pd_category.diffractogram_block_id
,_pd_category.diffractogram_id
,_pd_category.phase_block_id
, and_pd_category.phase_id
. What do we then make the category key? the phase_id or the block_phase_id? do we need to have dREL to autopopulate one or more of these keys? which takes primacy? Do implementers have to then implement two ways of linking dataitems in pdCIF?
Let me answer those questions:
The category keys are always the phase_id
and diffractogram_id
children (and potentially other per-category data names ). dREL can't usually auto-populate the phase_id/diffractogram_id keys as they are the inputs, not the outputs, but could easily auto-populate the block pointers. phase_id
and diffractogram_id
child data names have primacy. From a CIF point of view the phase_id
and diffractogram_id
data names must always be provided, I leave it up to the pdCIF community to assess the value of block pointers going forward. As a member of that community I am clearly shirking my duty to have an opinion here.
I can see the appeal in the block id as to recording the history associated with a block, assuming you follow the suggestion, but I don't know how it is to work in a system that isn't supposed to be block-aware (not that I know much about that in the first place).
Yep, you would end up with a long list of block names and no idea which ones were previous versions of which other ones. If we care we can add a data name "previous_version" or something, which is information that is not strictly available currently as the order of rows in a loop is arbitrary. I don't think this is important.
I'll get in touch with Dave Billing and see if he can send out emails et al.
@briantoby do you have an opinion?
Historically the motivation of the ids and pointers was to provide a structure that logically connects CIF data blocks in a persistent manner. This was needed because initial practice in the IUCr was to reduce multi-block CIFs to files each with a single block. Block names (as the name following data_...) were considered arbitrary and could be changed by the recipient. Powder diffraction requires the logical connection of multiple CIF data blocks. It became clear later that this was also needed for cases where a set of data might reference a different set of data that one would not want to include in a single file. (An example of this might be where a set of calibration measurements are repeated weekly; each dataset collected in that week would reference the calibration results, but one would not want to include the entire calibration CIF in each collected dataset.)
There are many different ways I could imagine this being done. I am not likely to change the code that produces these links in GSAS-II anytime soon and I do not expect any change will ever be made for GSAS2CIF in GSAS/EXPGUI (and that software suite refuses to die), but I see no reason to object to finding a better mechanism for inter-block referencing this to eventually supersede my initial attempt.
My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?
There are many different ways I could imagine this being done. I am not likely to change the code that produces these links in GSAS-II anytime soon and I do not expect any change will ever be made for GSAS2CIF in GSAS/EXPGUI (and that software suite refuses to die), but I see no reason to object to finding a better mechanism for inter-block referencing this to eventually supersede my initial attempt.
I don't want to remove block ids, so how it currently works will still work. I just forsee issues with adding new features and maintaining two different ways of linking and communicating to users their compatibilities or otherwise.
My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?
Not an answer, but an extension question (probably for @jamesrhester): how do we link, for instance, publication information, to a number of blocks. If there is a "summary" block containing the publication information (from CIF_CORE) or a series of PD_INSTR
values , how is this formally linked to the applicable blocks in the collection? Does the mere presence of a _pd_diffractogram.id
in a summary block do this job? or is there needed a formal data item?
My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?
Not an answer, but an extension question (probably for @jamesrhester): how do we link, for instance, publication information, to a number of blocks. If there is a "summary" block containing the publication information (from CIF_CORE) or a series of
PD_INSTR
values , how is this formally linked to the applicable blocks in the collection? Does the mere presence of a_pd_diffractogram.id
in a summary block do this job? or is there needed a formal data item?
The mere fact of aggregation (in whatever form: multi-block file, tgz file, directory, ...) is sufficient to indicate that blocks belong together. This is a minimal solution. The actual relationship of the contents of the blocks is determined by the relationships of the data names. Where actual relationships exist ("data X relates to experiment/sample/diffraction conditions/... Y") then that should be reflected in key data name relationships regardless of whether or not block pointers exist. So regardless of block pointers to diffractograms, the key data name relationships with _pd_diffractogram.id have to be explicit as that is no more than a statement of the truth.
The PUBLN question is interesting, as it forces us to ask what the publication relates to. In the old 1990s Volume C view, it related to a particular structure. In the modern day, it may relate to a whole family of structures contained in multiple data sets. So one answer might be that the publication refers to a list of data set identifiers. Such data sets could be composed especially for the publication so that all contributing calibration data etc. are included. These compositions might even have a DOI minted externally that is never embedded in each data block. But let's not solve this problem now as I think there are more pressing issues.
PUBLN was just an example to illustrate "a whole family of structures contained in multiple data sets", not posing a specific issue. I don't really want to solve it now :)
Just parking this here as some discussion around deprecation:
If block IDs are deprecated*, pdCIF will provide only a single way of linking data, and new additions to the dictionary need support only this one way. This will simplify reading, writing, and interpreting CIF files utilising pdCIF. Old users of pdCIF will need to update their data output methodologies to account for this change if they want to use new dictionary features; if they confine themselves to using pre-existing data items, then it should be possible to still use block pointers (which then leads to pdCIF programmers). pdCIF software writers could support both methods of linking to maintain maximum compatibility; they could also choose to only support non-deprecated features. Finally, deprecating block IDs will align pdCIF completely with the data linking methodology within the core CIF dictionary.
Keeping block IDs will mean that there are two ways of doing things (not ideal in a standard), but it will maintain continuity with previous methodologies. New additions to the dictionary will need to support different methods of linking. It may also become more confusing to (new and old) users of pdCIF, and lead to inconsistent linking between data.
.
*Deprecation, in this sense, is the process of taking older CIF data items and marking them as no longer being useful within the dictionary. This done when the information which a data item conveys has been superseded by a newer data item or a newer dictionary concept. As CIF is an archive format, the deprecated data item is not removed from the dictionary, as this would invalidate existing pdCIF files.
@rowlesmr mentioned:
Finally, deprecating block IDs will align pdCIF completely with the data linking methodology within the core CIF dictionary.
Please clarify what you mean by core CIF 'data linking methodology' (_audit.block_code, _audit_link.block_code, or ... ?)
To my mind, it's formal links to other data items via _pd_phase.id
and _pd_diffractogram.id
. It also allows the _audit*
data items to be used.
@jamesrhester will be able to give the nitty gritty.
The "data linking methodology" for core CIF is found here. Note that it does not explain how to locate all of the blocks belonging to a dataset, it leaves that to the context (Rule 1). These principles simply ensure that blocks can be synthesised together into a coherent whole via the use of linked category keys.
The discussion here was an attempt to develop a comprehensive way of linking data blocks, but did not succeed.
Raising this one again. I've not heard anything from anyone about it. I've reprompted MH @ IUCr for thier opinion.
My only question here is now does one reference a block that does not define either a phase or a dataset and thus has neither _pd_phase.id nor _pd_diffractogram.id?
just another thought - block ids only work for phases and difffractograms, as there are only dataitems that take block ids for this purpose.
@publcif @briantoby @jamesrhester @vaitkus
Is there any more input on deprecating block IDs?
My main reasoning behind wanting to do this is:
If block IDs are deprecated*, pdCIF will provide only a single way of linking data, and new additions to the dictionary need support only this one way.
One way of doing something is almost always better.
I do need to chase up with MH from IUCr.
I can't make a recommendation here without spending a lot more time reviewing dictionaries than I have available at the moment, but if _pd_phase.id and _pd_diffractogram.id are constructed in sort-of-unique ways, as was intended for block IDs, then that would serve the purpose for references between data & phases. Then indeed block IDs would not be needed for that purpose. OTOH, (and I don't have an answer for this,) Is that the only place where inter-block pointers are needed? As one example of a block that is neither fish nor fowl (so to speak), I sometimes create CIFs with a summary block -- no data & no phases -- for example in the case of a sequential fit, where I have overall results for a large set of fits.
Establishing relationships between blocks in multi-block CIFs is an important part of IUCr Journals CIF processing, especially for pdCIF and msCIF. Anything that can assist in this task is welcome. So as long as any new alternative mechanisms serve the same purpose, we will add them to our processing software where appropriate.
Bearing in mind that we receive CIFs which may be a concatenation of other CIFs, my main concern is that recognizing relationships between datablocks could be more problematic if 'the mere fact of aggregation (in whatever form: multi-block file, tgz file, directory, ...) is sufficient to indicate that blocks belong together'. But that concern is equally applicable to any block-linking ID (which is why I use UUIDs in my CIF applications when I need to store multiblock relationships, e.g. in 'summary blocks' such as @briantoby refers to...)
I can only repeat what I've said above. In short:
If UUIDs are used for diffractogram/phase/specimen/sample etc. identifiers, point (2) becomes irrelevant, as blocks can be grouped by which ones contain common identifier values for linked data names. However, readability suffers. As @briantoby says, a version of the block ID construction mechanism that is readable and "probably" unique seems like a good way to get around this. This would be enforced not by the CIF mechanics (e.g. a data type), but by recipients, so that if a single-block CIF is the entire data set, there is no need for fancy identifiers.
For example, we might suggest that all IDs are at least 6 non-whitespace characters. This should lower the chances of accidental collision. So, given a bunch of data blocks, the recipient (eg @publcif) might give an error or warning if identifiers that might have common values between data blocks have less than 6 non-whitespace characters in them, on the basis that there could be an accidental collision of identifiers.
Setting aside any general concerns with identifying multiblock relationships, I think it safe to say that IUCr Journals would not have a problem with this deprecation. After all, 'deprecation' is not 'prohibition'.
OTOH, (and I don't have an answer for this,) Is that the only place where inter-block pointers are needed? As one example of a block that is neither fish nor fowl (so to speak), I sometimes create CIFs with a summary block -- no data & no phases -- for example in the case of a sequential fit, where I have overall results for a large set of fits.
is kind of answered by
2. Which blocks constitute the collection of blocks is not currently solved within CIF. Discussion around this failed to reach a conclusion.
Currently, given a group of data blocks, there isn't a way to tell which ones belong together, apart from them being presented together.
.
To my mind, the fundamental* id
values are _pd_phase.id
, _pd_diffractogram.id
, and _pd_meas.detector_id
(see #138). Also, to a lesser degree, _pd_char.id
, _pd_prep.id
, and _pd_spec.id
(see #127). All of these are Text
(case-sensitive, with whitespace), except _pd_meas.detector_id
, which is Code
(case-insensitive, no whitespace).
All categories, except for PD_CALIB_INCIDENT_INTENSITY
, PD_CALIB_XCOORD
, PD_INSTR
, and PD_PEAK(_OVERALL)
, have a category key data item derived from one of those first three fundamental data items.
PD_CALIB_XCOORD
doesn't really count, as it has _pd_calib_xcoord.overall_id
, and PD_CALIB_XCOORD_OVERALL
links to a diffractogram. PD_PEAK(_OVERALL)
needs to be modified, most likely to follow the PD_CALIB_XCOORD(_OVERALL)
model**. PD_CALIB_INCIDENT_INTENSITY
should have a _pd_calib_incident_intensity.instr_id
(see https://github.com/COMCIFS/Powder_Dictionary/pull/92#issuecomment-1493593082), and PD_INSTR
should probably be promoted to be the fourth fundamental id
value.
_pd_char|prep|spec.id
(will (see #127)) link to each other, and there is _pd_diffractogram.spec_id
to link diffraction pattern to specimen.
A full, 128-bit GUID gives 10^49 unique values. Eight case-sensitive, nonwhitespace ASCII characters is 10^13. Eight unicode characters is a lot***. As @jamesrhester has said, we can strengthen the descriptions of the fundamental id
s (and probably also the spec
-related id
s) to be at least 8 nonwhitespace characters - one way to do this is to append the beginning of an GUID, e.g. "Corundum_749fa649".
* That is, the category keys of all other categories in pdCIF
are created in combination with at least one of these keys.
** No. They all need diffractogram_id category keys.
*** Imagine id
ing your phases using only emoji...
'deprecation' is not 'prohibition'.
Yes. It's saying, at least, you shouldn't use these. I don't know what the policy is on removing data items from dictionaries, but one would imagine that would also involve a change in the major version number of said dictionary.
Also, is this current thread a starting point for discussion on the uniqueness of id
-type dataitems* in CIF, in general?
see also https://github.com/COMCIFS/cif_core/issues/316 and #75
* _type.purpose Key
Could we leverage the current description of _pd_block.id
, and generalise it to fit any fundamental id-type? It is a little more formal/complicated than "suggesting that all IDs are at least 6 non-whitespace characters", but its a start.
A case-sensitive character string - the ID code - is assigned to an
identifier-like data item where the ID code must be reasonably expected
to be unique in the entire world.
The ID code is assigned by the originator of the data set and is used
for references between different CIF data items and categories. The
ID code will normally be created when the data set is first created.
One way which may be used to essentially guarentee uniqueness is to
use a Universally Unique IDentifier (UUID), as described by RFC4122.
However, the human-readability of the CIF suffers with this. This can
be mitigated by appending a portion of a UUID to a human-readable
expression.
A URI or DOI may be used as an ID code where the relevant item is
available online. This should only be done if it can be reasonably
expected that the file will be available online indefinitely.
The ID code is not intended to be parsed, and, as such, the
concatenation of several strings can be used to generate ID code that
can reasonably be expected to be unique.
One suggested format for the ID code is:
<descriptor>|<creator_name>|<date-time>
<descriptor> a descriptive string assigned by the originator of
the data set. This could be the name of the instrument
which collected the data, the phase name, or the
identity of the material.
<creator_name> is the name of the person best associated with
the creation of the values to which the identifier
belongs. This could be the person(s) which calibrated
the instrument, prepared the specimen, or collected
the diffractogram.
<date-time> is the date and time best associated with
the creation of the values to which the identifier
belongs. This could be when the instrument was
calibrated, the specimen prepared, or the data
collected or analysed. Date-time entries follow the
standard RFC 3339 format:
'yyyy-mm-ddThh:mm:ss{Z|[+-]zz:zz}'.
Some example ID codes are:
c7eade47-2641-4615-9be6-17a33f354030
NISTSRM-676A_8cb84ae7
Si-std_D500#1234-987|B.Toby|1991-09-15T16:54:00Z
CIF_core|https://raw.githubusercontent.com/COMCIFS/cif_core/master/cif_core.dic|2023-06-10T20:46:37+08:00
10.5281/zenodo.4415768/sxd_empty_run.nxs
I was just reading multi-block-principles.
Could the ID requirements be relaxed with the introduction of a (currently hypothicated) _audit_dataset.id
data item? Every data block which belongs in the same dataset (as defined by the person writing the CIF container) will have an identical _audit_dataset.id
data value. You can then go full-UUID on this data item, and all other id
-like items need only be unique within this namespace.
I would say that it does not matter what the name of the tag that gives a block a unique name, as long as one exists. OTOH, what to do if there is more than one such block id provided and they are different? So, it is best to select one name and be done.
I've conflated identifying blocks that belong together (eg ones from the same experiment) with uniquely identifying a single block.
The current multi block guidelines are implicit; if blocks appear together, they belong together. If we introduce a uuid-like identifer to tag disparate blocks as belonging together in that same experiment, then the uniqueness requirements for our other identifers are greatly reduced to only being unique within that same experiment.
@jamesrhester was the multiblock discussion on the cif dmg mailing list? I couldn't find it in the archive.
I've drafted up a summary document around these issues of block collections, as this affects all of CIF. Can we take this aspect of the discussion to an issue around that document?
The inconclusive discussion around how to link datablocks is here.
Just to come back on track (sorry all)
This is what we have to work with:
- Logical links between data items in a collection of blocks occurs through linked identifier data names and their values. This happens regardless of the existence or otherwise of block pointers.
- Which blocks constitute the collection of blocks is not currently solved within CIF. Discussion around this failed to reach a conclusion.
So, we need unique ID values. Currently, in pdCIF, this means that _pd_phase.id
, _pd_diffractogram.id
, _pd_instr.id
, _pd_instr_detector.id
, and _pd_char.id
must be unique. AFAIK, everything in pdCIF has some combination of these data items as category keys (as linked items) with or without a category-specific id.
So we have these approaches:
(from @briantoby / @jamesrhester) As @briantoby says, a version of the block ID construction mechanism that is readable and "probably" unique seems like a good way to get around (readability suffering from UUIDs).
(from @publcif) I use UUIDs in my CIF applications when I need to store multiblock relationships, e.g. in 'summary blocks' such as @briantoby refers to...
(from @jamesrhester) we might suggest that all IDs are at least 6 non-whitespace characters. This should lower the chances of accidental collision. So, given a bunch of data blocks, the recipient (eg @publcif) might give an error or warning if identifiers that might have common values between data blocks have less than 6 non-whitespace characters in them, on the basis that there could be an accidental collision of identifiers.
My contribution is a little more formal, and takes heavily from the block_id approach of @briantoby and flavoured by @publcif and @jamesrhester 's suggestions
From above:
A case-sensitive character string - the ID code - is assigned to an
identifier-like data item where the ID code must be reasonably expected
to be unique in the entire world.
The ID code is assigned by the originator of the data set and is used
for references between different CIF data items and categories. The
ID code will normally be created when the data set is first created.
One way which may be used to essentially guarentee uniqueness is to
use a Universally Unique IDentifier (UUID), as described by RFC4122.
However, the human-readability of the CIF suffers with this. This can
be mitigated by appending a portion of a UUID to a human-readable
expression.
A second way is to use a URI or DOI as an ID code where the relevant item
is available online. This should only be done if it can be reasonably
expected that the file will be available online indefinitely.
A third way is to concatenate several strings to generate an ID code that
can reasonably be expected to be unique.
One suggested format for the ID code following this last approach is:
<descriptor>|<creator_name>|<date-time>
<descriptor> a descriptive string assigned by the originator of
the data set. This could be the name of the instrument
which collected the data, the phase name, or the
identity of the material.
<creator_name> is the name of the person best associated with
the creation of the values to which the identifier
belongs. This could be the person(s) which calibrated
the instrument, prepared the specimen, or collected
the diffractogram.
<date-time> is the date and time best associated with
the creation of the values to which the identifier
belongs. This could be when the instrument was
calibrated, the specimen prepared, or the data
collected or analysed. Date-time entries follow the
standard RFC 3339 format:
'yyyy-mm-ddThh:mm:ss{Z|[+-]zz:zz}'.
Some example ID codes are:
c7eade47-2641-4615-9be6-17a33f354030
NISTSRM-676A_8cb84ae7
Si-std_D500#1234-987|B.Toby|1991-09-15T16:54:00Z
CIF_core|https://raw.githubusercontent.com/COMCIFS/cif_core/master/cif_core.dic|2023-06-10T20:46:37+08:00
10.1234/5678910/xraydata.xye
I've written it in a case-sensitive, no whitespace manner. I could be convinced on allowing whitespace, but I think keys should be case-sensitive.
And also, now that I've properly read the block collections guidance document from James, my previous post is essentially irrelevant.
@publcif, does James' suggested hybrid approach cater to your approach?
Split off from #39
From @vaitkus speaking on the role of
_ps_phase.id
and_pd_diffractogram.id
:Potentially heretical thing coming.
From what I've gathered, @briantoby introduced
_pd_block_id
as a way to link between data blocks, as CIF (at that time) had no way of different data blocks to talk to each other, as everything was essentially single-phase, single-pattern data blocks. Block IDs were used to identify blocks containing either phase or diffractogram, or phase and diffractogram, hence the need for_pd_phase_block_id
and_pd_block_diffractogram_id
to differentiate between blocks containing phase or diffractogram information.We now have
_pd_phase.id
* and_pd_diffractogram.id
. If the construction of these ID values is brought inline with the current suggestion for_pd_block.id
(ie more "uuid-like"), do we even need block IDs? I'm not knowledgable enough to make the call, but could the move to CIF_2.0 be used deprecate block IDs? To make the dictionary easier to maintain, could we split off deprecated data items into their own dictionary (cif_pow_deprecated.dic ?)**, such that they are still able to be used, but new additions to the dictionary would use_pd_phase.id
and_pd_diffractogram.id
to link between phase and diffractogram***..
* Yes,
_pd_phase_id
existed previously, but it was only there to link between_pd_phase_block_id
and_pd_refln.phase_id
.** Maybe it can be auto-included, like cif_core does with definitions in things like Miller indices...?
** we will need to make something like
_pd_phase.short_id
and_pd_diffractogram.short_id
, which have block-scope (rather than being unique) to link between things like_pd_refln.phase_id
,_pd_pref_orient_March_Dollase.diffractogram_id
, etc. so that we don't have to have huge, UUID-like values everywhere. In this example,_pd_phase.short_id
would be an alias for_pd_phase_id
. There would need to be a way to loop_pd_phase.short_id
with_pd_phase.id
in a datablock to link the two together.