COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction
4 stars 4 forks source link

A better way to loop `_pd_phase.id` and `_pd_diffractogram.id` #74

Closed rowlesmr closed 1 year ago

rowlesmr commented 1 year ago

In the DDL1 dictionary language, there was no way to formally link the contents of different data blocks. To solve this issue @briantoby (and friends) invented _pd_block_id, whose value is a globally* unique ID, in order to identify other data blocks containing information pertinant to the current data block.

In the original DDL1 pdCIF dictionary (which is still the current official release), PD_PHASE is a _list yes category (ie Loop in DDLm), containing _pd_phase_block_id, _pd_phase_id, _pd_phase_mass_%, and _pd_phase_name. _pd_phase_id served only to link between _pd_phase_block_id and _refln_phase_id. See also description in §3.3.6.1 in Vol. G.

Similarly, PD_BLOCK is a _list yes category (ie Loop), containing _pd_block_diffractogram_id and _pd_block_id.

As both were lists, this enabled each of them to loop within themselves, allowing links between blocks containing phase and diffractogram information to be made.

Note: There was a single identifier (_pd_block_id), and depending on the data item it was referenced from, if referred to either phase or diffractogram information (eg _pd_phase_block_id, _pd_block_diffractogram_id, _pd_calib_std_external_block_id...)

.

The new DDLm dictionary language allows for a formal definition of linkages between categories following the form of a relational database.

In the current development DDLm pdCIF dictionary, we now have formal definitions of "arbitrary labels" to identify data blocks containing phase and/or diffractogram information: _pd_phase.id, and _pd_diffractogram.id. These ID values are used in categories (eg PD_PHASE_MASS) to refer to the particular phase/diffractogram ID of interest (eg _pd_phase_mass.phase_id and _pd_phase_mass.diffractogram_id).

In adding these two separate identifiers, the exisiting PD_PHASE and newly created PD_DIFFRACTOGRAM categories are now both Set categories (ie _list no in DDL1), meaning that these value cannot be looped to show, for example, all the phases present in a diffractogram.

As a part of this process, we created the PD_PHASE_BLOCK and PD_BLOCK_DIFFRACTOGRAM categories as Loop in order to allow block IDs to be looped (_pd_phase_block.id and _pd_block_diffractogram.id), and then added _pd_phase_block.phase_id and _pd_block_diffractogram.diffractogram_id to allow phase and diffractogram IDs to be looped.

I now believe this decision was wrong, and conflates the idea of block IDs and phase/diffractogram IDs; yes, they are both identifiers, but block IDs identify a specific data block, while phase/diffractogram IDs identify any and all blocks containing information pertinent to a particular phase/diffractogram. The definitions of PD_PHASE_BLOCK and PD_BLOCK_DIFFRACTOGRAM should be reverted such they can only refer to block IDs, and new categories created to enable listing of phase/diffractogram IDs. Side ♩: I believe the definitions of _pd_phase.id, and _pd_diffractogram.id should be strengthened such that they are globally* unique.

To this end, I believe we should create two Loop categories, PD_PHASE_LIST and PD_DIFFRACTOGRAM_LIST, to deal with cases when we need to loop phase and diffractogram ids.

These two categories would allow for the looping of phase and diffractogram ids in data blocks where necessary, such as in a block containing a diffractogram with many phases.

Further to the creation of these two categories, other categories that have _categoryname.phase_id and/or _categoryname.diffractogram_id should also have _categoryname.phase_list_id and/or _categoryname.diffractogram_list_id to be able reference a _pd_phase_list.id or _pd_diffractogram_list.id. Such an addition should allow (I think?) something like the following to be written:

This is a single CIF file containing six data blocks; two diffraction patterns four phases:

#/#CIF_2.0
data_alumina_1
_pd_phase.id   ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE
# other phase info

data_alumina_2
_pd_phase.id   ALUMINA_ANOTHER_SUPER_STRING_THAT_IS_UNIVERSALLY_UNIQUE
# other phase info

data_silicon_1
_pd_phase.id   SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE
# other phase info

data_silicon_2
_pd_phase.id   SILICON_MORE_STRINGS_THAT_ARE_ALL_ALONE
# other phase info

data_DIFFPATT_1
_pd_diffractogram.id    DIFFPAT_1_LONG_UNIQUE_STRING

loop_
_pd_phase_list.id
_pd_phase_list.phase_id
1   ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE
2   SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE

loop_
_pd_phase_mass.phase_list_id
_pd_phase_mass.percent
1   72.11
2   27.89

loop_
_pd_pref_orient_March_Dollase.id
_pd_pref_orient_March_Dollase.phase_list_id
_pd_pref_orient_March_Dollase.hkl
_pd_pref_orient_March_Dollase.fract
_pd_pref_orient_March_Dollase.r
1   1   [1 0 4] 0.69    0.75
2   1   [1 0 1] 0.31    1.42
1   2   [1 1 1] .   0.98

loop_
_refln.index_h
_refln.index_k
_refln.index_l
_pd_refln.phase_list_id
_refln.d_spacing
0    0   12 1 1.082702
1    0  -14 1 0.905365
2    2    0 2 1.920168
3    1    1 2 1.637525
#...

 loop_
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.001    63.364    35.994    35.991 
5.004    78.007    36.200    36.200 
5.007    78.318    36.404    36.403 
#...

data_DIFFPATT_2
_pd_diffractogram.id    DIFFPAT_2_AN_EVEN_LONGER_UNIQUE_STRING

loop_
_pd_phase_list.id
_pd_phase_list.phase_id
1   ALUMINA_ANOTHER_SUPER_STRING_THAT_IS_UNIVERSALLY_UNIQUE
2   SILICON_MORE_STRINGS_THAT_ARE_ALL_ALONE

loop_
_pd_phase_mass.phase_list_id
_pd_phase_mass.percent
1   88.12
2   11.88

loop_
_pd_pref_orient_March_Dollase.id
_pd_pref_orient_March_Dollase.phase_list_id
_pd_pref_orient_March_Dollase.hkl
_pd_pref_orient_March_Dollase.fract
_pd_pref_orient_March_Dollase.r
1   1   [1 0 4] 0.75    0.98
2   1   [1 0 1] 0.25    1.05
1   2   [1 1 1] .   0.999

loop_
_refln.index_h
_refln.index_k
_refln.index_l
_pd_refln.phase_list_id
_refln.d_spacing
0    0   12 1 1.092702
1    0  -14 1 0.906365
2    2    0 2 1.921168
3    1    1 2 1.638525
#...

 loop_
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.001    43.364    25.994    25.991 
5.004    38.007    26.200    26.200 
5.007    38.318    26.404    26.403 
#...

Is there enough information in here to properly link the PO corrections in data_DIFFPATT_2 to the "correct" _pd_phase.id values, and not the other phases?

Comments, questions, suggestions, ideas, critisisms?

* literally, as in, the entire world.

rowlesmr commented 1 year ago

Please note that the _pd_phase_list.id will repeat in different data blocks, and therefore, so too will _pd_phase_mass.phase_list_id, _pd_refln.phase_list_id...

I don't know if that is OK.

rowlesmr commented 1 year ago

I hope this level of indirection is possible:

loop_
_pd_phase_list.id
_pd_phase_list.phase_id
1   ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE
2   SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE

loop_
_pd_phase_mass.phase_list_id
_pd_phase_mass.percent
1   72.11
2   27.89

PD_PHASE_LIST is keyed on _pd_phase_list.id and _pd_phase_list.phase_id. _pd_phase_list.id is an arbitrary id and _pd_phase_list.phase_id is a child of _pd_phase.id.

PD_PHASE_MASS is keyed on pd_phase_mass.phase_id and pd_phase_mass.diffractogram_id; both of which are children of _pd_phase.id and _pd_diffractogram.id, respectively.

If we add _pd_phase_mass.phase_list_id to PD_PHASE_MASS as a child of _pd_phase_list.id, are the values of _pd_phase_mass.percent still able to be assigned to the correct phases?

rowlesmr commented 1 year ago

I think this is a lost cause.

There are too many things with phase_ids and diffractogram_ids to then go back and double up with *_list_ids, too.