Closed jamesrhester closed 1 year ago
I will create an example containing both block pointers and these pointers to demonstrate. The idea of this PR is to create the category, and then other PRs can add in child data names of _pd_diffractogram.id
. I did pd_calc/meas/proc
in this PR to show how it works.
At the most basic level I can write:
_pd_phase.id
_pd_phase.block_id
1 the_first_phase
two the_second_phase
and then refer to 1
or two
in things like _pd_refln.phase_id
.
Does this addition mean I can write:
_pd_diffractogram.id
_pd_block_diffractogram.id
1 the_first_histogram
two the_second_histogram
and then refer to 1
or two
in things like pd_pref_orient_sphericalharmonics.diffractogram_id
?
Yes, that is correct. Sorry I haven't put together an example yet, I'll do that now.
OK, here is an example of the use of _pd_diffractogram.id
. This is the same content as example 3.3.7.1 of IT Vol G. There are now 9 blocks: one that contains no phase or diffractogram dependent information; one each for the two diffractograms and two phases containing only information relevant to those diffractograms/phases; and 2x2=4 data blocks containing information specific to a combination of a particular phase and diffractogram.
Again, I am not saying information must be separated into blocks in this way, but this corresponds to the default in the absence of any other guidance.
You will note that _pd_diffractogram.id
simply serves to identify the diffractogram to which the contents of the current block relate. There is no effort to point to a block that contains other information about the diffractogram. So the sum total of information about that diffractogram is just the collection of blocks with the same _pd_diffractogram.id
, where the way in which the blocks are collected is not specified: it may be via block pointers, it may be by virtue of being in the same CIF file, or it may be by virtue of being in the same archive.
# Example adapted from Example 3.3.7.1 in IT Vol G
# First edition.
#
# Describing a mixture of Ni and Si powder collected
# on two different banks of a TOF machine.
#
# So there are two phases (Ni and Si) and two
# diffractograms.
#
# In addition to the block pointers linking phases
# to diffractograms, there are phase identifiers
# and diffractogram identifiers also provided to
# allow non-pdcif aware software to properly
# assemble multiple data blocks together.
#
#= First CIF block ==================================
data_NISI_overall
_pd_block_id 2003-02-04T18:02|NISI|B_H_Toby|Overall
# publication and sample preparation information
# appears here (_publ_*, _journal_*, _pd_char_*
# _pd_prep_* items are omitted for brevity)
# Overall powder R-factors
_pd_proc_ls_prof_wR_factor 0.0370
# (other _refine_ls_* items omitted for brevity)
# pointers to the phase blocks
loop_ _pd_phase_block_id
2003-02-04T18:02|NISI_phase1|B_H_Toby||
2003-02-04T18:02|NISI_phase2|B_H_Toby||
# pointers to the diffraction patterns
loop_ _pd_block_diffractogram_id
2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD
#= Second CIF block =================================
# Information for phase 1
data_NISI_phase_1
_pd_block_id 2003-02-04T18:02|NISI_phase1|B_H_Toby||
# Data sets for phase 1
loop_ _pd_block_diffractogram_id
2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD
# Any phase-specific information in this block relates
# to this phase.
_pd_phase.id Ni
_cell_length_a 3.523433(29)
_cell_length_b 3.523433
_cell_length_c 3.523433
_cell_angle_alpha 90.0
_cell_angle_beta 90.0
_cell_angle_gamma 90.0
_cell_volume 43.74194
_symmetry_cell_setting cubic
_symmetry_space_group_name_H-M "F m 3 m"
loop_
_symmetry_equiv_pos_site_id
_symmetry_equiv_pos_as_xyz
1 +x,+y,+z 2 -x,-y,-z
# (other symmetry operations omitted for brevity)
loop_
_atom_site_type_symbol
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_occupancy
_atom_site_thermal_displace_type
_atom_site_U_iso_or_equiv
_atom_site_symmetry_multiplicity
NI 0.0 0.0 0.0 1.0 Uiso 0.00435(10) 4
loop_
_atom_type_symbol
_atom_type_number_in_cell
NI 4.0
# (_chemical_* \& _geom_* items omitted for brevity)
#= Third CIF block ==================================
# Information for phase 2
data_NISI_phase_2
_pd_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
# Data sets for phase 2
loop_ _pd_block_diffractogram_id
2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD
# Any phase-specific information in this block relates
# to this phase.
_pd_phase.id Si
_pd_phase_name Silicon
_cell_length_a 5.42957(9)
_cell_length_b 5.42957
_cell_length_c 5.42957
_cell_angle_alpha 90.0
_cell_angle_beta 90.0
_cell_angle_gamma 90.0
_cell_volume 160.06508
_symmetry_cell_setting cubic
_symmetry_space_group_name_H-M "F d 3 m"
loop_
_symmetry_equiv_pos_site_id
_symmetry_equiv_pos_as_xyz
1 +x,+y,+z 2 -x,-y,-z
# (other symmetry operations omitted for brevity)
loop_
_atom_site_type_symbol
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_occupancy
_atom_site_thermal_displace_type
_atom_site_U_iso_or_equiv
_atom_site_symmetry_multiplicity
SI 0.125 0.125 0.125 1.0 Uiso 0.00540(21) 8
loop_
_atom_type_symbol
_atom_type_number_in_cell
SI 8.0
# (_chemical_* \& _geom_* items omitted for brevity)
#= Fourth CIF block =================================
# Powder diffraction data for data set 1
data_NISI_p_01
_pd_block_id 2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
# (numerous _exptl_, _pd_*, _diffrn_ items describing
# the data set are omitted for brevity)
# All diffractogram-specific information relate to
# this diffractogram id
_pd_diffractogram.id TOF1
loop_
_atom_type_symbol
_atom_type_scat_length_neutron
_atom_type_scat_source
NI 1.0300 International_Tables_Vol_C
SI 0.4149 International_Tables_Vol_C
_diffrn_radiation_probe neutron
_pd_proc_ls_prof_wR_factor 0.0384
_pd_proc_ls_prof_wR_expected 0.0294
_refine_ls_R_Fsqd_factor 0.07288
_pd_proc_info_datetime 2003-02-04T18:02:09
_pd_calc_method "Rietveld Refinement"
_pd_meas_2theta_fixed 148.29
#---- raw data loop -----
loop_
_pd_meas_time_of_flight
_pd_meas_intensity_total
_pd_meas_point_id
1000.0 1818(34) 626
# (4494 TOF \& intensity values omitted for brevity)
_pd_meas_number_of_points 4495
#---- calculated data loop -----
loop_
_pd_proc_d_spacing
_pd_proc_intensity_total
_pd_proc_ls_weight
_pd_proc_intensity_bkg_calc
_pd_calc_intensity_total
_pd_proc_point_id
0.50035 0.424(7) 19401. 0.3726 0.4155 1
# (1647 processed/calculated points omitted for
# brevity)
_pd_proc_number_of_points 1648
_reflns_number_observed 60
# (_reflns_limit_* and _reflns_d_* items omitted for
# brevity)
#= Fifth CIF block ==================================
# Powder diffraction data for data set 2
data_NISI_p_02
_pd_block_id 2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD
# (numerous _exptl_, _pd_*, _diffrn_ items describing
# the data set are omitted for brevity)
# All diffractogram-specific information relate to
# this diffractogram id
_pd_diffractogram.id TOF2
loop_
_atom_type_symbol
_atom_type_scat_length_neutron
_atom_type_scat_source
NI 1.0300 International_Tables_Vol_C
SI 0.4149 International_Tables_Vol_C
_diffrn_radiation_probe neutron
_pd_proc_ls_prof_wR_factor 0.0363
_pd_proc_ls_prof_wR_expected 0.0222
_refine_ls_R_Fsqd_factor 0.07645
_pd_proc_info_datetime 2003-02-04T18:02:09
_pd_calc_method "Rietveld Refinement"
_pd_meas_2theta_fixed 88.05
#---- raw data loop -----
loop_
_pd_meas_time_of_flight
_pd_meas_intensity_total
_pd_meas_point_id
750.4 2780(42) 470
# (4650 TOF \& intensity values omitted for brevity)
_pd_meas_number_of_points 4651
#---- calculated data loop -----
loop_
_pd_proc_d_spacing
_pd_proc_intensity_total
_pd_proc_ls_weight
_pd_proc_intensity_bkg_calc
_pd_calc_intensity_total
_pd_proc_point_id
0.45802 0.778(9) 12931. 0.4211 0.7851 1
# (1932 processed/calculated points omitted for
# brevity)
_pd_proc_number_of_points 1933
# (_reflns_limit_* and _reflns_d_* items omitted for
# brevity)
#======Sixth CIF block (not present in original example)===#
# Per-phase, per diffractogram information
_pd_phase.id Ni
_pd_diffractogram.id TOF1
# phase table
_pd_phase_block_id 2003-02-04T18:02|NISI_phase1|B_H_Toby||
_pd_phase_mass_% 51(49)
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
2 2 0 o 15.254 15.195 0.00 1.24572
# (54 reflections omitted for brevity)
4 4 4 o 7.498 8.733 0.00 0.50856
#======Seventh CIF block (not present in original example)===#
# Per-phase, per diffractogram information
_pd_phase.id Si
_pd_diffractogram.id TOF1
_pd_phase_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
_pd_phase_mass_% 49(49)
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
4 0 0 o 9.773 9.812 180.00 1.35739
3 3 1 o 4.799 4.801 0.00 1.24563
9 5 3 o 2.350 2.396 0.00 0.50631
8 6 4 o 0.000 0.000 180.00 0.50412
#=====Eighth CIF block=====#
# Per-phase, per diffractogram information
_pd_phase.id Ni
_pd_diffractogram.id TOF2
# phase table
loop_
_pd_phase_block_id 2003-02-04T18:02|NISI_phase1|B_H_Toby||
_pd_phase_mass_% 51.38
# reflection table
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
2 0 0 o 16.505 16.060 0.00 1.76172
7 3 1 o 7.261 7.499 0.00 0.45871
5 5 3 o 7.261 7.499 0.00 0.45871
#=====Ninth CIF block======#
# Per-phase, per diffractogram information
_pd_phase.id Si
_pd_diffractogram.id TOF2
_pd_phase_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
_pd_phase_mass_% 48.62(28)
# reflection table
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
3 1 1 o 4.854 5.087 180.00 1.63708
2 2 2 o 0.000 0.000 0.00 1.56738
# (76 reflections omitted for brevity)
11 3 3 o 1.948 2.014 0.00 0.46053
10 6 2 o 0.000 0.000 0.00 0.45888
Just to get things straight in my head, with regard just one of the extra blocks:
#======Seventh CIF block (not present in original example)===#
# Per-phase, per diffractogram information
data_seventhblock #I'm assuming this is supposed to be here
_pd_phase.id Si
_pd_diffractogram.id TOF1
_pd_phase_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
_pd_phase_mass_% 49(49)
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
4 0 0 o 9.773 9.812 180.00 1.35739
3 3 1 o 4.799 4.801 0.00 1.24563
9 5 3 o 2.350 2.396 0.00 0.50631
8 6 4 o 0.000 0.000 180.00 0.50412
This datablock contains information about the phase Si
and diffractogram TOF1
because of the given values of _pd_phase.id
and _pd_diffractogram.id
. To get all the information about this phase and diffractogram, we need to go over every datablock in the container (file, folder, server...) and collate all the blocks that contain _pd_phase.id Si
or _pd_diffractogram.id TOF1
. This is the normal way in which information is shared between blocks in CIF.
The pdCIF way of linking things is through block ids. _pd_phase_block_id
says that there is phase-specific information in here belonging to the datablock labelled with the block id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
. Should there not also be a _pd_block_diffractogram_id
with the value 2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
?
also, I think that the definition of _pd_phase.id
now needs updating:
A code for each crystal phase used to link with _pd_refln.phase_id.
doesn't really cut it anymore.
Also:
Currently, PD_PHASE
is a Loop
category and contains only _pd_phase.block_id
and _pd_phase.id
. It is a Loop category, as a single diffractogram can contain multiple phases, and that requires multiple _pd_phase.block_id
values. You can also list multiple _pd_phase.id
values along with those _pd_phase.block_id
values
The equivalent datanames for diffractograms are _pd_block_diffractogram.id
and _pd_diffractogram.id
which belong to PD_BLOCK_DIFFRACTOGRAM
(Loop
) and PD_DIFFRACTOGRAM
(Set
), respectively.
There is a disparity here.
If a phase appears in multiple diffractograms, should we not be able to loop _pd_diffractogram.id
?
This datablock contains information about the phase
Si
and diffractogramTOF1
because of the given values of_pd_phase.id
and_pd_diffractogram.id
. To get all the information about this phase and diffractogram, we need to go over every datablock in the container (file, folder, server...) and collate all the blocks that contain_pd_phase.id Si
or_pd_diffractogram.id TOF1
. This is the normal way in which information is shared between blocks in CIF.
Yes, this is correct. I might not say "normal way..." but more "minimum requirement for linking information between blocks in CIF".
The pdCIF way of linking things is through block ids.
_pd_phase_block_id
says that there is phase-specific information in here belonging to the datablock labelled with the block id2003-02-04T18:02|NISI_phase2|B_H_Toby||
. Should there not also be a_pd_block_diffractogram_id
with the value2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
?
Yes, that is correct.
Also:
Currently,
PD_PHASE
is aLoop
category and contains only_pd_phase.block_id
and_pd_phase.id
. It is a Loop category, as a single diffractogram can contain multiple phases, and that requires multiple_pd_phase.block_id
values. You can also list multiple_pd_phase.id
values along with those_pd_phase.block_id
valuesThe equivalent datanames for diffractograms are
_pd_block_diffractogram.id
and_pd_diffractogram.id
which belong toPD_BLOCK_DIFFRACTOGRAM
(Loop
) andPD_DIFFRACTOGRAM
(Set
), respectively.There is a disparity here.
If a phase appears in multiple diffractograms, should we not be able to loop
_pd_diffractogram.id
?
OK, so the fundamental principle I'm trying to adhere to here is that the Default
presentation is one phase or one diffractogram per data block, so PD_PHASE
is a Set
category and PD_DIFFRACTOGRAM
is also a Set
category.
In this presentation, there will be only one value of _pd_phase.block_id
in a data block, so PD_PHASE
should be a Set
category. Any data block (e.g. a summary data block) that wants to tabulate the phases and list the data blocks where they can be found should simply set _audit.schema
to something that is not Default
, and then any category with a key data name can be looped in that data block regardless of Set
or Loop
type. The information listed would duplicate information that could be gained by going through the collection of data blocks if the other data blocks adhere to the Default
presentation (and I'm not saying they have to).
So PD_PHASE
should be Set
.
OK, so the fundamental principle I'm trying to adhere to here is that the
Default
presentation is one phase or one diffractogram per data block, soPD_PHASE
is aSet
category andPD_DIFFRACTOGRAM
is also aSet
category.In this presentation, there will be only one value of
_pd_phase.block_id
in a data block, soPD_PHASE
should be aSet
category. Any data block (e.g. a summary data block) that wants to tabulate the phases and list the data blocks where they can be found should simply set_audit.schema
to something that is notDefault
, and then any category with a key data name can be looped in that data block regardless ofSet
orLoop
type. The information listed would duplicate information that could be gained by going through the collection of data blocks if the other data blocks adhere to theDefault
presentation (and I'm not saying they have to).So
PD_PHASE
should beSet
.
I'm not sure I like that approach. Now I need to either have a bunch of individual data blocks (which is error prone to produce), or use non-standard (non-default) CIF (and trust that software knows what to do) in order to list (for example) the QPA of some diffractograms:
I'm only presenting one diffractogram per block, it's just that they reference multiple phases.
#two diffraction patterns of the same specimen, ala the PD beamline at the Aussietron.
data_histogram1
_pd_diffractogram.id dp1
_pd_block.id the_first_diffpat
loop_
_pd_phase.id
_pd_phase.block_id
_pd_phase.mass_percent
ph1 Ni 73
ph2 Si 20
ph3 Fe 7
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1234 1345 12
5.02 1246 1346 13
#...
data_histogram2
_pd_diffractogram.id dp2
_pd_block.id the_second_diffpat
loop_
_pd_phase.id
_pd_phase.block_id
_pd_phase.mass_percent
ph1 Ni 71
ph2 Si 21
ph3 Fe 8
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1243 1354 10
5.02 1264 1355 11
#...
versus
#two diffraction patterns of the same specimen, ala the PD beamline at the Aussietron.
data_histogram1
_pd_diffractogram.id dp1
_pd_block.id the_first_diffpat
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1234 1345 12
5.02 1246 1346 13
#...
data_histogram2
_pd_diffractogram.id dp2
_pd_block.id the_second_diffpat
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1243 1354 10
5.02 1264 1355 11
#...
data_ph1_dp1
_pd_phase.id ph1
_pd_diffractogram.id dp1
_pd_phase.block_id Ni
_pd_block.diffractogram_id the_first_diffpat
_pd_phase.mass_percent 73
data_ph2_dp1
_pd_phase.id ph2
_pd_diffractogram.id dp1
_pd_phase.block_id Si
_pd_block.diffractogram_id the_first_diffpat
_pd_phase.mass_percent 20
data_ph3_dp1
_pd_phase.id ph3
_pd_diffractogram.id dp1
_pd_phase.block_id Fe
_pd_block.diffractogram_id the_first_diffpat
_pd_phase.mass_percent 7
data_ph1_dp2
_pd_phase.id ph1
_pd_diffractogram.id dp2
_pd_phase.block_id Ni
_pd_block.diffractogram_id the_second_diffpat
_pd_phase.mass_percent 71
data_ph2_dp2
_pd_phase.id ph2
_pd_diffractogram.id dp2
_pd_phase.block_id Si
_pd_block.diffractogram_id the_second_diffpat
_pd_phase.mass_percent 21
data_ph3_dp2
_pd_phase.id ph3
_pd_diffractogram.id dp2
_pd_phase.block_id Fe
_pd_block.diffractogram_id the_second_diffpat
_pd_phase.mass_percent 8
I'm not sure I like that approach. Now I need to either have a bunch of individual data blocks (which is error prone to produce), or use non-standard (non-default) CIF (and trust that software knows what to do) in order to list (for example) the QPA of some diffractograms:
I understand your distaste. This is why I'm assuming that the powder CIF community will specify their own preferred way of grouping results into data blocks so that pdCIF-conversant software will only have to understand that one particular way. If that way is not the "Default" schema then "Set" and "Loop" category types are not a limitation and you can loop "Set" category items as long as a category key has been assigned in the dictionary.
Note that PD_PHASE
being a Set
category is essentially pre-defined because cif_core puts only a single "phase" into a single data block, i.e. one structure (cell/atom sites/space group).
But there still is only one diffraction pattern or phase in the data block, it just happens to want to reference many other phases or diffraction patterns, respectively. This even negates the original use of _pd_phase_id
for linking a _pd_refln_peak_id
with a _pd_block_id
. @briantoby any input on this?
Shouldn't we try and bake desired behaviour into the standard, and not require users to bypass it to do what they want to do? .
Could there be a _pd_phase.ids
and _pd_diffractogram.ids
specifically for looping in this type of thing? But that then introduces another level of indirection.
Don’t understand the issue
On Nov 22, 2022, at 8:30 AM, Matthew Rowles @.**@.>> wrote:
But there still is only one diffraction pattern or phase in the data block, it just happens to want to reference many other phases or diffraction patterns, respectively. This even negates the original use of _pd_phase_id for linking a _pd_refln_peak_id with a _pd_block_id. @briantobyhttps://github.com/briantoby any input on this?
But there still is only one diffraction pattern or phase in the data block, it just happens to want to reference many other phases or diffraction patterns, respectively. This even negates the original use of
_pd_phase_id
for linking a_pd_refln_peak_id
with a_pd_block_id
. @briantoby any input on this?
So the issue as I understand it is that a diffraction pattern data block might want to point to data blocks containing all of the phases that are in this diffraction pattern, and a phase data block might want to point to all of the data blocks that contain diffractograms that include this phase. In either case restricting _pd_phase
or _pd_diffractogram
to be single-valued in those data blocks would make this impossible.
Could there be a
_pd_phase.ids
and_pd_diffractogram.ids
specifically for looping in this type of thing? But that then introduces another level of indirection.
We are forced to conclude that the use of _pd_phase.id
and _pd_phase.block_id
are distinct and they should not be in the same category. How best to untangle? _pd_phase_block_id
has historical precedence and is well entrenched in its DDL1 form. So how about we have a new category pd_phase_block
to which only _pd_phase_block.id
belongs? This is a Loop
category. Then PD_PHASE
is a separate Set
category to hold overall information about the phase.
I don't think that _pd_block_diffractogram.id
has the same problems as it is already distinct from pd_diffractogram
as I have proposed it above.
Shouldn't we try and bake desired behaviour into the standard, and not require users to bypass it to do what they want to do? .
Absolutely. The Default
choice of Set
categories is dictated by the history of the CIF core, reflecting the original decision as to which categories are looped and which are not. Most software written for the core dictionary thus expects a single space group, a single set of cell parameters, that the atomic positions are for a single compound, and that a single set of measurements were performed at a single wavelength and single set of environmental conditions.
As long as the other dictionaries (powder, magnetism, incommensurate, twinning, etc.) do not change that choice of Set
categories, data names from all dictionaries can be freely mixed and software will correctly interpret the contents. An incorrect interpretation results if, for example, a single data block were to contain structural information from multiple phases, as an incorrect density would be calculated from the atom site list (not to mention bonds etc.).
Thus the Default
layout is an overall standard for all of CIF, which allows us a baseline way to combine data using different CIF dictionaries together: think a powder diffraction dataset containing diffractograms from neutron and synchrotron sources from a multi-phase sample for which one phase is a composite structure composed of two components and another phase is a magnetic structure. I am confident that there is a way to record this using the Default
choice of Set
categories without ever having done it myself, and that generic software (e.g. structure visualisation) will have no problems working with the relevant data blocks.
If we would like to mandate a particular, different, set of Set
categories for powder data, the tool at our disposal is _audit.schema
. By defining a value for this that is not Default
, and requiring any data blocks that do not follow Default
to set that value, it would be possible to validate a given set of data blocks for conformance to "the pdCIF requirements". The pdCIF dictionary would still define Set
categories to be compatible with Default
, because Set
only has meaning for the Default
schema. Unfortunately this mandating would not be fully machine-readable, as we haven't developed in that direction (but could).
_pd_block_diffractogram_id and _pd_phase_block_id are already used in loops.
We are forced to conclude that the use of
_pd_phase.id
and_pd_phase.block_id
are distinct and they should not be in the same category. How best to untangle?_pd_phase_block_id
has historical precedence and is well entrenched in its DDL1 form. So how about we have a new categorypd_phase_block
to which only_pd_phase_block.id
belongs? This is aLoop
category. ThenPD_PHASE
is a separateSet
category to hold overall information about the phase.
Sorry to be short, but it's bed time.
From the DDL1 dictionary, _pd_phase_id
and _pd_phase_block_id
are already used in loops with respect to _pd_refln.phase_id
in order to link reflections to phases, as in:
#\#CIF_1.1
data_diffpatt
_pd_block_id diffpatt_0
loop_
_pd_phase_id
_pd_phase_block_id
_pd_phase_mass_%
1 Al2O3_0 72.11(12)
2 Si_0 27.89(12)
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_d_spacing
0 0 6 1 2.165404
1 1 1 2 3.135621
#...
Here, the short length of the _pd_phase_id
string is used to reduce visual clutter in the reflection loop (ie by _pd_refln_phase_id
pointing to _pd_phase_id
instead of _pd_phase_block_id
). In refactoring this, there still needs to be a way of mimicking this behaviour.
How's this as a summary/suggestion:
We make PD_PHASE
a Set
category, with a single member _pd_phase.id
. This data name exists to provide a unique key identifiying a block containing information about a phase. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above. Note that this means that DDL1 _pd_phase_id
and DDLM _pd_phase.id
are not interchangeable (unless you fallback to _audit.schema
).
We create PD_PHASE_BLOCK
as a Loop
category, with members _pd_phase_block.id
and _pd_phase_block.phase_id
. _pd_phase_block.id
retains its current definition and use. _pd_phase_block.phase_id
is a replacement data name for the DDL1 _pd_phase_id
, inasmuch as it allows you to list multiple phases belonging to a histogram and link them with reflections, PO corrections and other things. This is an extension to James' above suggestion. This retains the equivalence of DDL1 _pd_phase_block_id
and DDLM _pd_phase_block.id
(to the user, at least).
We create PD_DIFFRACTOGRAM
as a Set
category, with a single member _pd_diffractogram.id
. This data name exists to provide a unique key identifiying a block containing information about a diffraction pattern. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above.
We retain PD_BLOCK_DIFFRACTOGRAM
* as a Loop
category, with member _pd_block_diffractogram.id
and add _pd_block_diffractogram.diffractogram_id
. _pd_block_diffractogram.id
retains its current definition and use. _pd_block_diffractogram.diffractogram_id
acts in the same way as _pd_phase_block.phase_id
, inasmuch as it allows you to list multiple diffractograms in which a phase appears and link them PO corrections and other things. This is an extension to James' above suggestion.
.
Next: What to do with _pd_phase.mass_percent
and _pd_phase.name
(which currently live in PD_PHASE
as a Loop
category)?
In my limited experience, _pd_phase.name
** only seems to make sense as a single-valued data item providing a name in the block which gives the cell params and other "normal" details of a phase; it could probably stay in PD_PHASE
as a Set
category. Is there a way of easily searching the COD to see what is in there wrt phase.name?
_pd_phase.mass_percent
probably really only makes sense to be used in the block containing the histogram in which various phases exist, and therefore, would be looped with _pd_phase_block.id
(and/or _pd_phase_block.phase_id
) to link the QPA to each phase. I don't think it sounds right to put it in PD_PHASE_BLOCK
, as this is talking purely about the block, not the phase therein. Could we cheat a little and also create PD_PHASE_MASS
and have _pd_phase_mass.percent
? It retains the ability to change the "." to a "_" and be directly translateable...
.
I think this retains current capabilities and extends modern CIF linking to (at least a part of) pdCIF. @jamesrhester @briantoby ?
.
*Also, why is it PD_BLOCK_DIFFRACTOGRAM
, and not PD_DIFFRACTOGRAM_BLOCK
?
** _pd_phase.name
should probably have contents of Text
, and not Code
.
How's this as a summary/suggestion:
We make
PD_PHASE
aSet
category, with a single member_pd_phase.id
. This data name exists to provide a unique key identifiying a block containing information about a phase. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above. Note that this means that DDL1_pd_phase_id
and DDLM_pd_phase.id
are not > interchangeable (unless you fallback to_audit.schema
).
Agreed that we have to do this.
We create
PD_PHASE_BLOCK
as aLoop
category, with members_pd_phase_block.id
and_pd_phase_block.phase_id
._pd_phase_block.id
retains its current definition and use._pd_phase_block.phase_id
is a replacement data name for the DDL1_pd_phase_id
, inasmuch as it allows you to list multiple phases belonging to a histogram and link them with reflections, PO corrections and other things. This is an extension to James' above suggestion. This retains the equivalence of DDL1_pd_phase_block_id
and DDLM_pd_phase_block.id
(to the user, at least).
Тhe way in which Set categories work is that all child data names of the set category data name (_pd_phase.id
) if the child data names are one of the key data names of their category have the value of the parent data name and so don't need to be explicitly stated. This would then be true of _pd_phase_block.phase_id
if it is a key data name in the loop. Of course, we want the opposite, that is, multiple values of ...phase_id
. However, if _pd_phase_block.id
is the only key data name then this suggestion would work, and given that no block ids would ever be repeated in the loop (so it is a key), this is fine.
We create
PD_DIFFRACTOGRAM
as aSet
category, with a single member_pd_diffractogram.id
. This data name exists to provide a unique key identifiying a block containing information about a diffraction pattern. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above.
Yes, though note that it doesn't identify a block as such, it assigns a value to _pd_diffractogram.id
which is the value for all child key data names of _pd_diffractogram.id
in the same block.
We retain
PD_BLOCK_DIFFRACTOGRAM
* as aLoop
category, with member_pd_block_diffractogram.id
and add_pd_block_diffractogram.diffractogram_id
._pd_block_diffractogram.id
retains its current definition and use._pd_block_diffractogram.diffractogram_id
acts in the same way as_pd_phase_block.phase_id
, inasmuch as it allows you to list multiple diffractograms in which a phase appears and link them PO corrections and other things. This is an extension to James' above suggestion.
Ok, so we do the same as for PD_PHASE_BLOCK
, with PD_BLOCK_DIFFRACTOGRAM
having single key data name _pd_block_diffractogram.id
, then we can have multiple values of _pd_block_diffractogram.diffractogram_id
.
Note in these solutions for PD_PHASE_BLOCK
and PD_BLOCK_DIFFRACTOGRAM
the phase/diffractogram identifiers are optional, which is correct, as the vital information to be conveyed is the block name, and the name of the phase/diffractogram can be found from the block that is pointed to if it is missing.
Next: What to do with
_pd_phase.mass_percent
and_pd_phase.name
(which currently live inPD_PHASE
as aLoop
category)?In my limited experience,
_pd_phase.name
** only seems to make sense as a single-valued data item providing a name in the block which gives the cell params and other "normal" details of a phase; it could probably stay inPD_PHASE
as aSet
category. Is there a way of easily searching the COD to see what is in there wrt phase.name?
Agree.
_pd_phase.mass_percent
probably really only makes sense to be used in the block containing the histogram in which various phases exist, and therefore, would be looped with_pd_phase_block.id
(and/or_pd_phase_block.phase_id
) to link the QPA to each phase. I don't think it sounds right to put it inPD_PHASE_BLOCK
, as this is talking purely about the block, not the phase therein. Could we cheat a little and also createPD_PHASE_MASS
and have_pd_phase_mass.percent
? It retains the ability to change the "." to a "_" and be directly translateable...
_pd_phase.mass_percent
depends on both the phase and the diffractogram, so it belongs to some category that has both phase and diffractogram as key data names (noting that many categories are current missing their implicit _pd_diffractogram.id
key data name as it hasn't yet been accepted). If there are no obvious candidates then I agree that a PD_PHASE_MASS
category could be created. This is really a separate issue to the current pull request so perhaps best moved to a new issue.
I think this retains current capabilities and extends modern CIF linking to (at least a part of) pdCIF. @jamesrhester @briantoby ?
.
*Also, why is it
PD_BLOCK_DIFFRACTOGRAM
, and notPD_DIFFRACTOGRAM_BLOCK
? **_pd_phase.name
should probably have contents ofText
, and notCode
.
No idea on the historical naming choices, agree on Text
not Code
.
I think I've added all the required data items. No promises on correct definitions, linking, category_keys....
Feel free to ignore/revert.
.
One thing I just noticed, _pd_phase.id
must be unique within the entire CIF. _pd_phase_id
only had to be unique within the block it was used.
I think I've added all the required data items. No promises on correct definitions, linking, category_keys....
Feel free to ignore/revert.
I'll do some cosmetic editing but looks like a good start.
One thing I just noticed,
_pd_phase.id
must be unique within the entire CIF._pd_phase_id
only had to be unique within the block it was used.
If the same value of _pd_phase.id
occurs somewhere else in the dataset (whether the data set is a single CIF file or a collection of files of varying formats), the contents of that block must be referring to the same phase. Picking up inconsistencies should be an easy task for validation.
I added a missing description for PD_PHASE_MASS. I think it looks OK now, @vaitkus, can you have a look?
probably in another PR, _pd_phase_mass.diffractogram_id
and _pd_phase_mass.phase_id
need dREL to get their respective values from _pd_diffractogram.id
and _pd_phase.id
.
I think I'm happy with this. Do you want @vaitkus to review?
Let's wait a bit for @vaitkus, if he's busy we'll merge and rely on a formal release process later on for any further review.
Happy Winter holidays
😐
Either way, an additional set of eyes with the general knowledge of DDLm does not hurt either.
As you've just shown!
- Some category names were slightly incorrect which resulted in name duplication. I suggested fixed directly in the PR. Not sure why these issues were not detected by the automated check, probably due to a slightly outdated version of cif_ddlm_dic_check being used (easy fix in the future).
- The human-readable definitions of
_pd_data.diffractogram_id
,_pd_meas.diffractogram_id
and_pd_proc.diffractogram_id
all had the same text so I tried to tie it more to the categories that these items belong to. Changes suggested directly in the PR. Feel free to rephrase it in a more readable way.
I've commited these. And need to go back and change them for line length!
- The content type of the
_pd_diffractogram.id
and the types of all items that point to it should probably be changed fromCode
toText
. This is necessary because according to the human-readable definition of the_pd_diffractogram.id
data item and the associated dREL code, the value of this items can be used interchangeably with_pd_block.id
which is of theText
content type. Due to this we may run into some incompatibles down the line.
I'll wait for comment from @jamesrhester on this. One thing of note, the description of _pd_block.id
says "Blank spaces may also not be used. Capitalization may be used within the ID code but should not be considered significant - searches for data-set ID names should be case-insensitive": This lines up more with Code
, than Text
; should the description be updated, or the content type?
- The
_definition.update
date in the definitions of the_pd_block_diffractogram.id
,PD_DATA
,PD_CALC
,PD_MEAS
,PD_PROC
should also be updated since these definitions were seemingly changed.
The change was that the appropriate _pd*.diffractogram_id
was added to the category key. I've put in 2022-10-11, as that is the modified date for PD_DIFFRACTOGRAM
.
Some of the human-readable descriptions now contain references to outdated data names that no longer exist. Specifically, the following changes should be made:
- save_PD_BLOCK:
_pd_calib.std_external_block_id
->_pd_calib_std.external_block_id
.- save_PD_BLOCK:
_pd_phase.block_id
->pd_phase_block.id
.- save_pd_calc_component.block_id:
_pd_phase.block_id
->pd_phase_block.id
.
I've changed these and found two other _pd_phase.block_id ones. _pd_calib_std.external_block_id
will possibly change again.
Alternatively, some alias may have accidentally been left out from the new definitions.
I haven't checked for these.
I also have two question that may require slightly more discussion:
- I would really appreciate if an example file illustrating the usage of data relationships introduced by this PR could be added (e.g. under examples/). Since several example were already presented in the discussion of this PR, maybe one of them could turned into a proper CIF file? It is not necessary to update the GitHub check to automatically validate these example files at this point, but at least having a working example would be extremely useful for testing and the general understanding of the concept.
Easy one first: Yes, we should work up a proper example. I've also mentioned this in #47, but more of a more general one, as we need to wait for the all the (current) additions (especially this one) to be merged to properly before we can do that (large) task.
- If one of the intents is to be able to distribute the same dataset in various alternative forms (e.g. as a single CIF file, as multiple files of potentially different data formats) maybe it would also make sense to introduce some kind of a
_pd_dataset.uuid
data item that would allow to easily link these separate files? If I correctly understand the current approach, the_pd_phase.id
data item should generally be used for this purpose, however, I am unsure how well it would work outside of a specific context (e.g. if we encounter the same id in different files outside of an archive file, how sure can we be that they indeed describe the same dataset?). Use of a UUID would greatly reduce the probability of a id clash. The same can probably also be said for the_pd_block.id
value format since it uses a timestamp as one of its components, but it does not seem that currently this format is enforced in the descriptions of_pd_diffractogram.id
and related items (and potentially this format is not even desired for these fields due to its complexity/lack of brevity).
Potentially heretical thing coming.
From what I've gathered, @briantoby introduced _pd_block_id
as a way to link between data blocks, as CIF (at that time) had no way of different data blocks to talk to each other, as everything was essentially single-phase, single-pattern data blocks. Block IDs were used to identify blocks containing either phase or diffractogram, or phase and diffractogram, hence the need for _pd_phase_block_id
and _pd_block_diffractogram_id
to differentiate between blocks containing phase or diffractogram information.
We now have _pd_phase.id
* and _pd_diffractogram.id
. If the construction of these ID values is brought inline with the current suggestion for _pd_block.id
(ie more "uuid-like"), do we even need block IDs? I'm not knowledgable enough to make the call, but could the move to CIF_2.0 be used deprecate block IDs? To make the dictionary easier to maintain, could we split off deprecated data items into their own dictionary (cif_pow_deprecated.dic ?), such that they are still able to be used, but new additions to the dictionary would use _pd_phase.id
and _pd_diffractogram.id
to link between phase and diffractogram**.
.
* Yes, _pd_phase_id
existed previously, but it was only there to link between _pd_phase_block_id
and _pd_refln.phase_id
.
** we will need to make something like _pd_phase.short_id
and _pd_diffractogram.short_id
, which have block-scope (rather than being unique) to link between things like _pd_refln.phase_id
, _pd_pref_orient_March_Dollase.diffractogram_id
, etc. so that we don't have to have huge, UUID-like values everywhere. In this example, _pd_phase.short_id
would be an alias for _pd_phase_id
. There would need to be a way to loop _pd_phase.short_id
with _pd_phase.id
in a datablock to link the two together.
The content type of the _pd_diffractogram.id and the types of all items that point to it should probably be changed from Code to Text. This is necessary because according to the human-readable definition of the _pd_diffractogram.id data item and the associated dREL code, the value of this items can be used interchangeably with _pd_block.id which is of the Text content type. Due to this we may run into some incompatibles down the line.
Yes, let's make it Text
.
I agree that we should work up some full examplex once we have finished a few of our other current ongoing changes. Meanwhile, hopefully the examples in this PR are sufficient to assess how the mechanism works.
If one of the intents is to be able to distribute the same dataset in various alternative forms (e.g. as a single CIF file, as multiple files of potentially different data formats) maybe it would also make sense to introduce some kind of a _pd_dataset.uuid data item that would allow to easily link these separate files?
As far as I can tell no such mechanism is guaranteed to be robust in all situations. For example, if calibration files are included in a data set, as they should, then they will belong to all datasets that have used the same calibrations. Therefore every data block associated with the calibrations will need to have a new UUID inserted, that is, must be changed. There was a long discussion about some sort of way of linking all data blocks in a dataset together here, from which I drew the conclusion that the only universal way is for the context to dictate aggregation. This does not invalidate the use of block_id
or a future dataset.uuid
, just recognises the limitations.
In any case, discussion of a dataset.uuid
data item could be raised as a separate issue in the cif_core repository but shouldn't be a blocker for the current issue.
We now have _pd_phase.id* and _pd_diffractogram.id. If the construction of these ID values is brought inline with the current suggestion for _pd_block.id (ie more "uuid-like"), do we even need block IDs? I'm not knowledgable enough to make the call, but could the move to CIF_2.0 be used deprecate block IDs? To make the dictionary easier to maintain, could we split off deprecated data items into their own dictionary (cif_pow_deprecated.dic ?), such that they are still able to be used, but new additions to the dictionary would use _pd_phase.id and _pd_diffractogram.id to link between phase and diffractogram**.
The block_id
data names can only be deprecated, not removed, in order to maintain backwards compatibility. That said, as blocks do not exist in the relational model, I see these data names as non-ideal but in certain circumstances I guess are useful for validation. My intuition is that this way of linking blocks will become increasingly unwieldy for dictionary authors and software authors as pdCIF expands. Anyway, the best place for a discussion is a separate issue, not here.
@vaitkus : if you're happy, I can merge.
Thank you for all the changes and for filing my questions as separate issues where relevant. It is indeed the better to continue the discussions there.
I'll wait for comment from @jamesrhester on this. One thing of note, the description of _pd_block.id says "Blank spaces may also not be used. Capitalization may be used within the ID code but should not be considered significant - searches for data-set ID names should be case-insensitive": This lines up more with Code, than Text; should the description be updated, or the content type?
@rowlesmr , this is an extremely good point. Given the properties that you highlighted (case-insensitivity, no spaces) the Code
content type becomes nearly mandatory to ensure the correct automated processing of _pd_block.id
(e.g. the validator program looks at the content type of the data item and decides if case-sensitivity is important when resolving references or checking uniqueness). However, I see that changes have already been made to change everything to Text
. Not sure if we should change _pd_block.id
, _pd_diffractogram.id
and the related data items to Code
in this PR or in a separate one. @jamesrhester , what is your opinion on this?
I also noticed that the PR branch now contains the '#Untitled-4#'
which was accidentally introduced and later removed from the master branch. Should this file also be explicitly deleted in this PR or will syncing with the master branch/merging the PR automatically remove it?
I agree that we should work up some full examples once we have finished a few of our other current ongoing changes. Meanwhile, hopefully the examples in this PR are sufficient to assess how the mechanism works.
I will of course not push on this, but IMHO having a separate examples file that describe the ongoing discussion, even if incomplete, proved quite useful during the development of the TopoCif dictionary since it made immediately clear when changes broke something that they were not supposed to affect.
Anyways, I copied the example from the discussions to my machine as a separate file and with some minor changes (introduction of data_
elements) got it working. I tried validating it against the dictionary in the PR branch and ran into some issues. This might be the intended behaviour related to the new approach, but I just want to make sure it is:
Multiple data blocks (e.g. NISI_overall
, NISI_phase_1
, etc.) contain both the _pd_block_diffractogram_id
data item and the _pd_block_id
data item. Since the dictionary now formally links _pd_block_diffractogram_id
to _pd_block.id
, all the values of _pd_block_diffractogram_id
must match the values of _pd_block.id
. However, in multiple cases this is not true or even not possible when more than one _pd_block_diffractogram_id
value is provide as in done in the NISI_overall
data block. Is there a problem with the example, the data item relation or is this simply a new schema/approach that allows PD files to disregard the formal relationship imposed by the _name.linked_item_id
attribute?
I approve the PR anyways in case these questions run the risk of going too much off topic again.
Multiple data blocks (e.g.
NISI_overall
,NISI_phase_1
, etc.) contain both the_pd_block_diffractogram_id
data item and the_pd_block_id
data item. Since the dictionary now formally links_pd_block_diffractogram_id
to_pd_block.id
, all the values of_pd_block_diffractogram_id
must match the values of_pd_block.id
. However, in multiple cases this is not true or even not possible when more than one_pd_block_diffractogram_id
value is provide as in done in theNISI_overall
data block. Is there a problem with the example, the data item relation or is this simply a new schema/approach that allows PD files to disregard the formal relationship imposed by the_name.linked_item_id
attribute?
We don't want PD to be special and ignore the _name.linked_item_id
attribute.
The intent here is to enforce the requirement that _pd_block_diffractogram.id
be a valid _pd_block_id
and that the data block(s) pointed to by _pd_block_diffractogram.id
contains a diffractogram pertinant to the block containing the pointer.
My interpretation of what you're saying is that a block containing a _pd_block_diffractogram.id
and a _pd_block_id
must have, by definition, the same value associated with both data names.
I think the way to do what we want to do is to just have the purpose as Encode
:
save_pd_block_diffractogram.id
_definition.id '_pd_block_diffractogram.id'
_alias.definition_id '_pd_block_diffractogram_id'
_definition.update 2022-10-11
_description.text
;
A block ID code (see _pd_block.id) that identifies #...
;
_name.category_id pd_block_diffractogram
_name.object_id id
_type.purpose Encode
_type.source Assigned
_type.container Single
_type.contents Text
save_
Multiple data blocks (e.g. NISI_overall, NISI_phase_1, etc.) contain both the _pd_block_diffractogram_id data item and the _pd_block_id data item. Since the dictionary now formally links _pd_block_diffractogram_id to _pd_block.id, all the values of _pd_block_diffractogram_id must match the values of _pd_block.id. However, in multiple cases this is not true or even not possible when more than one _pd_block_diffractogram_id value is provide as in done in the NISI_overall data block. Is there a problem with the example, the data item relation or is this simply a new schema/approach that allows PD files to disregard the formal relationship imposed by the _name.linked_item_id attribute?
I initially changed the dREL to properly reflect the idea that if _pd_diffractogram.id
is missing then _pd_block.id
can be used. But this will also not work, as dREL operates in a space where all data blocks have been merged and so the correct _pd_block.id
to use must be determined using only the value of key data names in common. Not only are there none of these, but a single data block can have multiple _pd_block.id
values. So I've just deleted the dREL. Thanks @vaitkus for picking this up.
Closes #21 . The current PR adds the category and child data names of
_pd_diffractogram.id
. This has the effect of linking the tabulated diffraction scan data to the concept of a diffractogram. When loops containing per-diffractogram, per-something-else information (e.g. per diffractogram per phase) are created, adding child data names of_pd_diffractogram.id
formally states that the information in those loops depends on a particular diffractogram.This is intended to be complementary to the block pointers used in pdCIF, and will work in situations where software is dictionary-aware but not pdCIF-aware.