COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction
4 stars 4 forks source link

Need data name to act as the original meaning of `_pd_phase_id` #71

Closed rowlesmr closed 1 year ago

rowlesmr commented 1 year ago

From the end of https://github.com/COMCIFS/Powder_Dictionary/pull/39#issuecomment-1371739998

In the original pdCIF dictionary (or at least the one converted to DDLm), PD_PHASE was a Loop category, and _pd_phase_id served only to link between _pd_phase_block_id and _refln_phase_id. See also description in §3.3.6.1 in Vol. G.

We no longer have that ability to use a short, block-level identifier.

I think we make something like _pd_phase.short_id and _pd_diffractogram.short_id, which have block-scope (ie not necessarily globally unique, only locally) to link between things like _pd_refln.phase_id, _pd_pref_orient_March_Dollase.diffractogram_id, etc. so that we don't have to have huge, UUID-like values everywhere.

For example involving phase id, from the CIF1.1 version of things, you would do:

#/#CIF_1.1
data_xye_files\576460-mac-001_reb_0003.xye_0

_pd_block_id                    xye_files\576460-mac-001_reb_0003.xye_0

loop_
_pd_phase_id  # this data item acts only as a link between _pd_refln_phase_id and _pd_phase_block_id
_pd_phase_block_id
_pd_phase_mass_%
1   ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE   72.11
2   SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE 27.89

loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_d_spacing
0    0   12 1    1.082702
1    0  -14 1    0.905365
2    2    0 2    1.920168
3    1    1 2    1.637525
#...

 loop_
_pd_meas_2theta_scan
_pd_proc_intensity_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
5.001    43.364    25.994    25.991 
5.004    38.007    26.200    26.200 
5.007    38.318    26.404    26.403 
#...

With the current dictionary, you would have to do:

#/#CIF_2.0
data_xye_files\576460-mac-001_reb_0003.xye_0

_pd_diffractogram.id                    xye_files\576460-mac-001_reb_0003.xye_0

loop_
_pd_phase_mass.phase_id
_pd_phase_mass.percent
ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE   72.11
SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE 27.89

loop_
_refln.index_h
_refln.index_k
_refln.index_l
_pd_refln.phase_id
_refln.d_spacing
0    0   12 ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE       1.082702
1    0  -14 ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE       0.905365
2    2    0 SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE     1.920168
3    1    1 SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE     1.637525
#...

 loop_
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.001    43.364    25.994    25.991 
5.004    38.007    26.200    26.200 
5.007    38.318    26.404    26.403 
#...

which is not very pretty.

rowlesmr commented 1 year ago

ignore this comment - I forgot you can't loop items from different categories

Could something like this work?

#/#CIF_2.0
data_xye_files\576460-mac-001_reb_0003.xye_0

_pd_diffractogram.id                    xye_files\576460-mac-001_reb_0003.xye_0

loop_
_pd_phase_block.short_id
_pd_phase_block.phase_id
_pd_phase_mass.percent
1   ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE   72.11
2   SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE 27.89

loop_
_refln.index_h
_refln.index_k
_refln.index_l
_pd_refln.phase_id
_refln.d_spacing
0    0   12 1 1.082702
1    0  -14 1 0.905365
2    2    0 2 1.920168
3    1    1 2 1.637525
#...

 loop_
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.001    43.364    25.994    25.991 
5.004    38.007    26.200    26.200 
5.007    38.318    26.404    26.403 
#...
rowlesmr commented 1 year ago

A bit more of a radical idea: revert PD_PHASE_BLOCK and PD_BLOCK_DIFFRACTOGRAM to just deal with block id values; nothing at all to do with the new phase and diffractogram_id. Create PD_PHASE_LIST and PD_DIFFRACTOGRAM_LIST to deal with cases when we need to loop phase and diffractogram ids.

Both would be Loop categories. Both would have two data items.

In this world, the new data item _pd_refln.phase_list_id would be linked to _pd_phase_list.id, and _pd_refln.phase_list_id would be aliased with _pd_refln_phase_id, and become a category key for REFLN, instead of _pd_refln.phase_id (which gets deprecated).

Furthermore, categories that have _categoryname.phase_id and/or _categoryname.diffractogram_id will also get _categoryname.phase_list_id and/or _categoryname.diffractogram_list_id to be able to take advantage of this.

#/#CIF_2.0
data_xye_files\576460-mac-001_reb_0003.xye_0

_pd_diffractogram.id                    xye_files\576460-mac-001_reb_0003.xye_0

loop_
_pd_phase_list.id
_pd_phase_list.phase_id
1   ALUMINA_A_BIG_LONG_STRING_THAT_IS_GLOBALLY_UNIQUE
2   SILICON_ANOTHER_STRING_THAT_IS_SUPPOSED_TO_BE_ALONE

loop_
_pd_phase_mass.phase_list_id
_pd_phase_mass.percent
1   72.11
2   27.89

loop_
_refln.index_h
_refln.index_k
_refln.index_l
_pd_refln.phase_list_id
_refln.d_spacing
0    0   12 1 1.082702
1    0  -14 1 0.905365
2    2    0 2 1.920168
3    1    1 2 1.637525
#...

 loop_
_pd_meas.2theta_scan
_pd_proc.intensity_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.001    43.364    25.994    25.991 
5.004    38.007    26.200    26.200 
5.007    38.318    26.404    26.403 
#...

* Do these arbitrary codes need to be unique within the entire container, or just in the block?

rowlesmr commented 1 year ago

This has wandered a bit off course. I'll try again.

jamesrhester commented 1 year ago

OK, interesting ideas. Note in your initial example that if each data block contains only a single phase, then the long complicated phase id is only provided once in the file, and all other values are then obtained from that value, so I don't see long complicated names as being an issue. I think we can simply instruct the pdCIF creator to ensure that all phase_id values will be unique within the data set.

rowlesmr commented 1 year ago

if each data block contains only a single phase, then the long complicated phase id is only provided once in the file, and all other values are then obtained from that value

? I don't understand how this works? How can you refer to other things without their (long) key?

data_phase1
_pd_phase.id   PHASE_ONE_ID
# phase 1 information goes here

data_phase2
_pd_phase.id   PHASE_TWO_ID
# phase 2 information goes here

data_diffpat1
_pd_diffractogram.id   DIFFPATT_WITH_PHASE_1

loop_
_pd_phase_mass.phase_id  # what else goes here but a _pd_phase.id?
_pd_phase_mass.percent
PHASE_ONE_ID   72.11  # I don't know, the rest is amorphous...

# other diffractogram info goes here

data_diffpat2
_pd_diffractogram.id   DIFFPATT_WITH_PHASE_2

loop_
_pd_phase_mass.phase_id  # what else goes here but a _pd_phase.id?
_pd_phase_mass.percent
PHASE_TWO_ID   37.33  # I don't know, the rest is amorphous...

# other diffractogram info goes here
rowlesmr commented 1 year ago

can we take this to #74? its a better version of the description of PD_PHASE_LIST and PD_DIFFRACTOGRAM_LIST.

jamesrhester commented 1 year ago

? I don't understand how this works? How can you refer to other things without their (long) key?

I misspoke, I meant provided once per data block, not per file. Let us move to #74