How unique do id-like identifiers need to be?

rowlesmr commented 1 year ago

Yes, _pd_diffractogram.id, _pd_phase.id, and _pd_block.id need to/should be unique in the whole world, but what about things like _pd_pref_orient_March_Dollase.id or _pd_data.point_id?

or do we need to ensure that there are sufficient category keys such that their combination is unique in the container?

For example, in this single container:

#/#CIF_2.0
data_diffpatt_1

_pd_diffractogram.id    DIFFPAT_NUMBER_1_A_LONG_STRING

loop_
_pd_phase_list.id
_pd_phase_list.phase_id
1   FIRST_ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE
2   FIRST_SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE

loop_
_pd_phase_mass.phase_list_id
_pd_phase_mass.percent
1   72.11
2   27.89

loop_
_pd_pref_orient_March_Dollase.id
_pd_pref_orient_March_Dollase.phase_list_id
_pd_pref_orient_March_Dollase.hkl
_pd_pref_orient_March_Dollase.fract
_pd_pref_orient_March_Dollase.r
1   1   [1 0 4] 0.69    0.75
2   1   [1 0 1] 0.31    1.42
1   2   [1 1 1] .   0.98

 loop_
_pd_data.point_id
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
1   5.001    43.364    25.994    25.991 
2   5.004    38.007    26.200    26.200 
3   5.007    38.318    26.404    26.403 
#...

################################################################

data_diffpatt_2

_pd_diffractogram.id    DIFFPAT_NUMBER_2_A_LONG_STRING

loop_
_pd_phase_list.id
_pd_phase_list.phase_id
1   SECOND_ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE
2   SECOND_SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE

loop_
_pd_phase_mass.phase_list_id
_pd_phase_mass.percent
1   95.95
2   4.05

loop_
_pd_pref_orient_March_Dollase.id
_pd_pref_orient_March_Dollase.phase_list_id
_pd_pref_orient_March_Dollase.hkl
_pd_pref_orient_March_Dollase.fract
_pd_pref_orient_March_Dollase.r
1   1   [1 0 4] 0.51 0.97
2   1   [1 0 1] 0.49    0.89
1   2   [1 1 1] .   1.01

 loop_
_pd_data.point_id
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
1   5.001    45.364    45.994    35.991 
2   5.004    49.007    46.200    36.200 
3   5.007    63.318    56.404    46.403 
#...

Are _pd_phase_list.id, _pd_pref_orient_March_Dollase.id, and _pd_data.point_id sufficiently unique?

Are the values of _pd_phase_mass.phase_list_id sufficient to properly identify the correct phase to which the mass percent applies?

Do all values of every data item have the scope of the entire container?

jamesrhester commented 1 year ago

Yes, the container (the data block) creates a scope. Identifiers have to be sufficiently unique that a unique row in the category is identified by the values of the key data names of the category, bearing in mind that child data names of single-valued items e.g. _pd_phase.id do not need to be explicitly provided if there is a single value of _pd_phase.id already present in the data block.

Where a data set is created that is not going to be expanded in the future or have data blocks mixed and matched to create new data sets, it is likely that very simple ids are sufficient. That, I think, describes the vast bulk of data sets produced. Where data blocks containing calibration information are included, they would need to have more or less unique ids to make sure that they don't clash with IDs chosen by the data set they are being joined to. Again, such clashes can be detected and corrected at data block amalgamation time if desired.

rowlesmr commented 1 year ago

Yes, the container (the data block) creates a scope.

I mean "container" as in all of the data blocks that are in the CIF file(s) that make up a single "experiment" (there are two blocks in the one container above.). I think you use the word "data collection" in §1.4.2.1.1 in the draft you sent me.

Identifiers have to be sufficiently unique that a unique row in the category is identified by the values of the key data names of the category, bearing in mind that child data names of single-valued items e.g. _pd_phase.id do not need to be explicitly provided if there is a single value of _pd_phase.id already present in the data block.

As an example, I think what you said means that if I have a _pd_diffractogram.id in a data block, I don't need to define _pd_data.diffractogram_id in that block. If I have a _pd_data.diffractogram_id, I don't need a _pd_calc.diffracogram_id in the same block? (assuming I'm talking about the same diffractogram id.)

Does "child data names" mean (i) a linked data name, or (ii) the data name of a category that is a subcategory of another?

Where a data set is created that is not going to be expanded in the future or have data blocks mixed and matched to create new data sets, it is likely that very simple ids are sufficient. That, I think, describes the vast bulk of data sets produced.

So having data values that are keys have the same value in difference data blocks in the one container/collection is OK?

Where data blocks containing calibration information are included, they would need to have more or less unique ids to make sure that they don't clash with IDs chosen by the data set they are being joined to. Again, such clashes can be detected and corrected at data block amalgamation time if desired.

Calibration datasets are an outlier; I believe they should be specified quite uniquely, as they will be referenced in a lot of places.

jamesrhester commented 1 year ago

Yes, the container (the data block) creates a scope.

I mean "container" as in all of the data blocks that are in the CIF file(s) that make up a single "experiment" (there are two blocks in the one container above.). I think you use the word "data collection" in §1.4.2.1.1 in the draft you sent me.

OK, well, all containers create a scope. In the case of e.g. a directory of multi-block CIF data files, there are no datanames that are in the container scope, because they are all inside data blocks, so the scope of any data name is the data block that they are in. In a hierarchical data file like HDF5, each level in the hierarchy creates a scope. When describing how HDF5 contents are described by a CIF dictionary (which they very much can be) scope is one of the things that should be pinned down - and in any case has to be done by anyone wanting to interpret an HDF5 file with or without the aid of a CIF dictionary. Getting sidetracked.

Identifiers have to be sufficiently unique that a unique row in the category is identified by the values of the key data names of the category, bearing in mind that child data names of single-valued items e.g. _pd_phase.id do not need to be explicitly provided if there is a single value of _pd_phase.id already present in the data block.

As an example, I think what you said means that if I have a _pd_diffractogram.id in a data block, I don't need to define _pd_data.diffractogram_id in that block. If I have a _pd_data.diffractogram_id, I don't need a _pd_calc.diffracogram_id in the same block? (assuming I'm talking about the same diffractogram id.)

Yes, generally if the data name in _name.linked_item_id is present and there is only one value in scope then you don't need to repeat it again.

Does "child data names" mean (i) a linked data name, or (ii) the data name of a category that is a subcategory of another?

It must be an explicitly linked data name.

Where a data set is created that is not going to be expanded in the future or have data blocks mixed and matched to create new data sets, it is likely that very simple ids are sufficient. That, I think, describes the vast bulk of data sets produced.

So having data values that are keys have the same value in difference data blocks in the one container/collection is OK?

If it is a collection that is meant to form a single data set, it is only OK for keys to have the same value if the values don't lead to rows in the tables that have the same key data values having contradictory values for the other data names in that row.

Where data blocks containing calibration information are included, they would need to have more or less unique ids to make sure that they don't clash with IDs chosen by the data set they are being joined to. Again, such clashes can be detected and corrected at data block amalgamation time if desired.

Calibration datasets are an outlier; I believe they should be specified quite uniquely, as they will be referenced in a lot of places.

yes.

rowlesmr commented 1 year ago

Ug. My brain is starting to hurt and I'm getting confused in between all the conversations/thread/emails. But I do think you're starting to get through

rowlesmr commented 1 year ago

Long and short, IDs need to be unique in the whole world. See https://github.com/COMCIFS/comcifs.github.io/blob/master/draft/block_collections.md on a proposed method to provide a namespace for identifiers to live in.

Also https://github.com/COMCIFS/Powder_Dictionary/issues/56#issuecomment-1592258714

COMCIFS / Powder_Dictionary

How unique do id-like identifiers need to be? #75