Add pd_diffractogram category.

COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction

4 stars 4 forks source link

Add pd_diffractogram category. #39

Closed jamesrhester closed 1 year ago

jamesrhester commented 1 year ago

Closes #21 . The current PR adds the category and child data names of _pd_diffractogram.id. This has the effect of linking the tabulated diffraction scan data to the concept of a diffractogram. When loops containing per-diffractogram, per-something-else information (e.g. per diffractogram per phase) are created, adding child data names of _pd_diffractogram.id formally states that the information in those loops depends on a particular diffractogram.

This is intended to be complementary to the block pointers used in pdCIF, and will work in situations where software is dictionary-aware but not pdCIF-aware.

jamesrhester commented 1 year ago

I will create an example containing both block pointers and these pointers to demonstrate. The idea of this PR is to create the category, and then other PRs can add in child data names of _pd_diffractogram.id. I did pd_calc/meas/proc in this PR to show how it works.

rowlesmr commented 1 year ago

At the most basic level I can write:

_pd_phase.id
_pd_phase.block_id
1   the_first_phase
two   the_second_phase

and then refer to 1 or two in things like _pd_refln.phase_id.

Does this addition mean I can write:

_pd_diffractogram.id
_pd_block_diffractogram.id
1   the_first_histogram
two   the_second_histogram

and then refer to 1 or two in things like pd_pref_orient_sphericalharmonics.diffractogram_id?

jamesrhester commented 1 year ago

Yes, that is correct. Sorry I haven't put together an example yet, I'll do that now.

jamesrhester commented 1 year ago

OK, here is an example of the use of _pd_diffractogram.id. This is the same content as example 3.3.7.1 of IT Vol G. There are now 9 blocks: one that contains no phase or diffractogram dependent information; one each for the two diffractograms and two phases containing only information relevant to those diffractograms/phases; and 2x2=4 data blocks containing information specific to a combination of a particular phase and diffractogram.

Again, I am not saying information must be separated into blocks in this way, but this corresponds to the default in the absence of any other guidance.

You will note that _pd_diffractogram.id simply serves to identify the diffractogram to which the contents of the current block relate. There is no effort to point to a block that contains other information about the diffractogram. So the sum total of information about that diffractogram is just the collection of blocks with the same _pd_diffractogram.id, where the way in which the blocks are collected is not specified: it may be via block pointers, it may be by virtue of being in the same CIF file, or it may be by virtue of being in the same archive.

# Example adapted from Example 3.3.7.1 in IT Vol G
# First edition.
#
# Describing a mixture of Ni and Si powder collected
# on two different banks of a TOF machine.
#
# So there are two phases (Ni and Si) and two
# diffractograms.
#
# In addition to the block pointers linking phases
# to diffractograms, there are phase identifiers
# and diffractogram identifiers also provided to
# allow non-pdcif aware software to properly
# assemble multiple data blocks together.
#
#= First CIF block ==================================
data_NISI_overall

_pd_block_id  2003-02-04T18:02|NISI|B_H_Toby|Overall

# publication and sample preparation information 
# appears here (_publ_*, _journal_*, _pd_char_* 
#  _pd_prep_* items are omitted for brevity) 

# Overall powder R-factors

_pd_proc_ls_prof_wR_factor            0.0370
# (other _refine_ls_* items omitted for brevity)

# pointers to the phase blocks
loop_   _pd_phase_block_id
      2003-02-04T18:02|NISI_phase1|B_H_Toby||
      2003-02-04T18:02|NISI_phase2|B_H_Toby||
# pointers to the diffraction patterns
loop_   _pd_block_diffractogram_id
      2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
      2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD

#= Second CIF block =================================
# Information for phase 1
data_NISI_phase_1

_pd_block_id 2003-02-04T18:02|NISI_phase1|B_H_Toby||

# Data sets for phase 1

loop_   _pd_block_diffractogram_id
  2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
  2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD

# Any phase-specific information in this block relates
# to this phase.

_pd_phase.id                          Ni

_cell_length_a                        3.523433(29)
_cell_length_b                        3.523433
_cell_length_c                        3.523433
_cell_angle_alpha                     90.0
_cell_angle_beta                      90.0
_cell_angle_gamma                     90.0
_cell_volume                          43.74194
_symmetry_cell_setting                cubic
_symmetry_space_group_name_H-M        "F m 3 m"

loop_
_symmetry_equiv_pos_site_id 
_symmetry_equiv_pos_as_xyz
       1 +x,+y,+z           2 -x,-y,-z 
# (other symmetry operations omitted for brevity)

loop_
_atom_site_type_symbol
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_occupancy
_atom_site_thermal_displace_type
_atom_site_U_iso_or_equiv
_atom_site_symmetry_multiplicity
NI   0.0  0.0  0.0   1.0   Uiso 0.00435(10) 4

loop_
_atom_type_symbol
_atom_type_number_in_cell
      NI  4.0
# (_chemical_* \& _geom_* items omitted for brevity)

#= Third CIF block ==================================
# Information for phase 2
data_NISI_phase_2

_pd_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||

# Data sets for phase 2

loop_   _pd_block_diffractogram_id
  2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD
  2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD

# Any phase-specific information in this block relates
# to this phase.

_pd_phase.id                          Si

_pd_phase_name                        Silicon          
_cell_length_a                        5.42957(9)
_cell_length_b                        5.42957
_cell_length_c                        5.42957
_cell_angle_alpha                     90.0
_cell_angle_beta                      90.0
_cell_angle_gamma                     90.0
_cell_volume                          160.06508
_symmetry_cell_setting                cubic
_symmetry_space_group_name_H-M        "F d 3 m"

loop_
_symmetry_equiv_pos_site_id 
_symmetry_equiv_pos_as_xyz
       1 +x,+y,+z           2 -x,-y,-z 
# (other symmetry operations omitted for brevity)

loop_     
_atom_site_type_symbol
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_occupancy
_atom_site_thermal_displace_type
_atom_site_U_iso_or_equiv
_atom_site_symmetry_multiplicity
SI  0.125 0.125 0.125  1.0  Uiso  0.00540(21)  8

loop_
_atom_type_symbol
_atom_type_number_in_cell
     SI  8.0        
 # (_chemical_* \& _geom_* items omitted for brevity) 

#= Fourth CIF block =================================
# Powder diffraction data for data set 1
data_NISI_p_01

_pd_block_id 2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD

# (numerous _exptl_, _pd_*, _diffrn_ items describing 
# the data set are omitted for brevity)

# All diffractogram-specific information relate to
# this diffractogram id

_pd_diffractogram.id      TOF1

loop_
_atom_type_symbol
_atom_type_scat_length_neutron      
_atom_type_scat_source
  NI   1.0300   International_Tables_Vol_C
  SI   0.4149   International_Tables_Vol_C

_diffrn_radiation_probe               neutron
_pd_proc_ls_prof_wR_factor            0.0384
_pd_proc_ls_prof_wR_expected          0.0294
_refine_ls_R_Fsqd_factor              0.07288

_pd_proc_info_datetime         2003-02-04T18:02:09
_pd_calc_method                "Rietveld Refinement"
_pd_meas_2theta_fixed          148.29

#---- raw data loop -----

loop_
_pd_meas_time_of_flight      
_pd_meas_intensity_total
_pd_meas_point_id
      1000.0   1818(34)     626
# (4494 TOF \& intensity values omitted for brevity)

_pd_meas_number_of_points             4495

#---- calculated data loop -----
loop_
_pd_proc_d_spacing
_pd_proc_intensity_total
_pd_proc_ls_weight
_pd_proc_intensity_bkg_calc
_pd_calc_intensity_total
_pd_proc_point_id
   0.50035  0.424(7)  19401.  0.3726  0.4155    1
# (1647 processed/calculated points omitted for 
# brevity)

_pd_proc_number_of_points             1648

_reflns_number_observed               60

# (_reflns_limit_* and _reflns_d_* items omitted for 
# brevity)

#= Fifth CIF block ==================================

# Powder diffraction data for data set 2
data_NISI_p_02

_pd_block_id 2003-02-04T18:02|NISI_H_02|B_H_Toby|GPD

# (numerous _exptl_, _pd_*, _diffrn_ items describing 
# the data set are omitted for brevity)
# All diffractogram-specific information relate to
# this diffractogram id

_pd_diffractogram.id      TOF2

loop_
_atom_type_symbol      
_atom_type_scat_length_neutron
_atom_type_scat_source
  NI   1.0300  International_Tables_Vol_C
  SI   0.4149  International_Tables_Vol_C

_diffrn_radiation_probe               neutron
_pd_proc_ls_prof_wR_factor            0.0363
_pd_proc_ls_prof_wR_expected          0.0222
_refine_ls_R_Fsqd_factor              0.07645
_pd_proc_info_datetime         2003-02-04T18:02:09
_pd_calc_method                "Rietveld Refinement"
_pd_meas_2theta_fixed          88.05

#---- raw data loop -----

loop_
_pd_meas_time_of_flight      
_pd_meas_intensity_total
_pd_meas_point_id
      750.4   2780(42)     470
# (4650 TOF \& intensity values omitted for brevity)

_pd_meas_number_of_points             4651

#---- calculated data loop -----
loop_
_pd_proc_d_spacing
_pd_proc_intensity_total
_pd_proc_ls_weight
_pd_proc_intensity_bkg_calc
_pd_calc_intensity_total
_pd_proc_point_id
  0.45802  0.778(9)  12931.  0.4211  0.7851    1
# (1932 processed/calculated points omitted for 
# brevity)

_pd_proc_number_of_points             1933

#  (_reflns_limit_* and _reflns_d_* items omitted for 
# brevity)

#======Sixth CIF block (not present in original example)===#

# Per-phase, per diffractogram information
_pd_phase.id Ni
_pd_diffractogram.id TOF1

# phase table

_pd_phase_block_id     2003-02-04T18:02|NISI_phase1|B_H_Toby||  
_pd_phase_mass_%      51(49)

loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
   2   2   0  o  15.254  15.195   0.00    1.24572
# (54 reflections omitted for brevity)
   4   4   4  o   7.498   8.733   0.00    0.50856

#======Seventh CIF block (not present in original example)===#

# Per-phase, per diffractogram information

_pd_phase.id Si
_pd_diffractogram.id TOF1
_pd_phase_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
_pd_phase_mass_%     49(49)

loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
   4   0   0  o   9.773   9.812  180.00   1.35739
   3   3   1  o   4.799   4.801   0.00    1.24563
   9   5   3  o   2.350   2.396   0.00    0.50631
   8   6   4  o   0.000   0.000  180.00   0.50412

#=====Eighth CIF block=====#
# Per-phase, per diffractogram information

_pd_phase.id Ni
_pd_diffractogram.id TOF2

# phase table

loop_    
_pd_phase_block_id  2003-02-04T18:02|NISI_phase1|B_H_Toby||  
_pd_phase_mass_%    51.38

# reflection table
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
   2   0   0  o  16.505  16.060   0.00    1.76172
   7   3   1  o   7.261   7.499   0.00    0.45871
   5   5   3  o   7.261   7.499   0.00    0.45871

#=====Ninth CIF block======#

# Per-phase, per diffractogram information
_pd_phase.id Si
_pd_diffractogram.id TOF2
_pd_phase_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
_pd_phase_mass_%   48.62(28)

# reflection table
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
   3   1   1  o   4.854   5.087  180.00   1.63708
   2   2   2  o   0.000   0.000   0.00    1.56738
# (76 reflections omitted for brevity)
  11   3   3  o   1.948   2.014   0.00    0.46053
  10   6   2  o   0.000   0.000   0.00    0.45888

rowlesmr commented 1 year ago

Just to get things straight in my head, with regard just one of the extra blocks:

#======Seventh CIF block (not present in original example)===#
# Per-phase, per diffractogram information
data_seventhblock #I'm assuming this is supposed to be here

_pd_phase.id Si
_pd_diffractogram.id TOF1
_pd_phase_block_id 2003-02-04T18:02|NISI_phase2|B_H_Toby||
_pd_phase_mass_%     49(49)

loop_
_refln_index_h
_refln_index_k
_refln_index_l
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_phase_calc
_refln_d_spacing
   4   0   0  o   9.773   9.812  180.00   1.35739
   3   3   1  o   4.799   4.801   0.00    1.24563
   9   5   3  o   2.350   2.396   0.00    0.50631
   8   6   4  o   0.000   0.000  180.00   0.50412

This datablock contains information about the phase Si and diffractogram TOF1 because of the given values of _pd_phase.id and _pd_diffractogram.id. To get all the information about this phase and diffractogram, we need to go over every datablock in the container (file, folder, server...) and collate all the blocks that contain _pd_phase.id Si or _pd_diffractogram.id TOF1. This is the normal way in which information is shared between blocks in CIF.

The pdCIF way of linking things is through block ids. _pd_phase_block_id says that there is phase-specific information in here belonging to the datablock labelled with the block id 2003-02-04T18:02|NISI_phase2|B_H_Toby||. Should there not also be a _pd_block_diffractogram_id with the value 2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD?

rowlesmr commented 1 year ago

also, I think that the definition of _pd_phase.id now needs updating:

A code for each crystal phase used to link with _pd_refln.phase_id.

doesn't really cut it anymore.

rowlesmr commented 1 year ago

Also:

Currently, PD_PHASE is a Loop category and contains only _pd_phase.block_id and _pd_phase.id. It is a Loop category, as a single diffractogram can contain multiple phases, and that requires multiple _pd_phase.block_id values. You can also list multiple _pd_phase.id values along with those _pd_phase.block_id values

The equivalent datanames for diffractograms are _pd_block_diffractogram.id and _pd_diffractogram.id which belong to PD_BLOCK_DIFFRACTOGRAM (Loop) and PD_DIFFRACTOGRAM (Set), respectively.

There is a disparity here.

If a phase appears in multiple diffractograms, should we not be able to loop _pd_diffractogram.id?

jamesrhester commented 1 year ago

This datablock contains information about the phase Si and diffractogram TOF1 because of the given values of _pd_phase.id and _pd_diffractogram.id. To get all the information about this phase and diffractogram, we need to go over every datablock in the container (file, folder, server...) and collate all the blocks that contain _pd_phase.id Si or _pd_diffractogram.id TOF1. This is the normal way in which information is shared between blocks in CIF.

Yes, this is correct. I might not say "normal way..." but more "minimum requirement for linking information between blocks in CIF".

The pdCIF way of linking things is through block ids. _pd_phase_block_id says that there is phase-specific information in here belonging to the datablock labelled with the block id 2003-02-04T18:02|NISI_phase2|B_H_Toby||. Should there not also be a _pd_block_diffractogram_id with the value 2003-02-04T18:02|NISI_H_01|B_H_Toby|GPD?

Yes, that is correct.

Also:

Currently, PD_PHASE is a Loop category and contains only _pd_phase.block_id and _pd_phase.id. It is a Loop category, as a single diffractogram can contain multiple phases, and that requires multiple _pd_phase.block_id values. You can also list multiple _pd_phase.id values along with those _pd_phase.block_id values

The equivalent datanames for diffractograms are _pd_block_diffractogram.id and _pd_diffractogram.id which belong to PD_BLOCK_DIFFRACTOGRAM (Loop) and PD_DIFFRACTOGRAM (Set), respectively.

There is a disparity here.

If a phase appears in multiple diffractograms, should we not be able to loop _pd_diffractogram.id?

OK, so the fundamental principle I'm trying to adhere to here is that the Default presentation is one phase or one diffractogram per data block, so PD_PHASE is a Set category and PD_DIFFRACTOGRAM is also a Set category.

In this presentation, there will be only one value of _pd_phase.block_id in a data block, so PD_PHASE should be a Set category. Any data block (e.g. a summary data block) that wants to tabulate the phases and list the data blocks where they can be found should simply set _audit.schema to something that is not Default, and then any category with a key data name can be looped in that data block regardless of Set or Loop type. The information listed would duplicate information that could be gained by going through the collection of data blocks if the other data blocks adhere to the Default presentation (and I'm not saying they have to).

So PD_PHASE should be Set.

rowlesmr commented 1 year ago

OK, so the fundamental principle I'm trying to adhere to here is that the Default presentation is one phase or one diffractogram per data block, so PD_PHASE is a Set category and PD_DIFFRACTOGRAM is also a Set category.

In this presentation, there will be only one value of _pd_phase.block_id in a data block, so PD_PHASE should be a Set category. Any data block (e.g. a summary data block) that wants to tabulate the phases and list the data blocks where they can be found should simply set _audit.schema to something that is not Default, and then any category with a key data name can be looped in that data block regardless of Set or Loop type. The information listed would duplicate information that could be gained by going through the collection of data blocks if the other data blocks adhere to the Default presentation (and I'm not saying they have to).

So PD_PHASE should be Set.

I'm not sure I like that approach. Now I need to either have a bunch of individual data blocks (which is error prone to produce), or use non-standard (non-default) CIF (and trust that software knows what to do) in order to list (for example) the QPA of some diffractograms:

I'm only presenting one diffractogram per block, it's just that they reference multiple phases.

#two diffraction patterns of the same specimen, ala the PD beamline at the Aussietron.
data_histogram1

_pd_diffractogram.id dp1
_pd_block.id the_first_diffpat

loop_
_pd_phase.id
_pd_phase.block_id
_pd_phase.mass_percent
ph1 Ni 73
ph2 Si 20
ph3 Fe  7

loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1234 1345 12
5.02 1246 1346 13
#...

data_histogram2

_pd_diffractogram.id dp2
_pd_block.id the_second_diffpat

loop_
_pd_phase.id
_pd_phase.block_id
_pd_phase.mass_percent
ph1 Ni 71
ph2 Si 21
ph3 Fe  8

loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1243 1354 10
5.02 1264 1355 11
#...

versus

#two diffraction patterns of the same specimen, ala the PD beamline at the Aussietron.
data_histogram1

_pd_diffractogram.id dp1
_pd_block.id the_first_diffpat

loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1234 1345 12
5.02 1246 1346 13
#...

data_histogram2

_pd_diffractogram.id dp2
_pd_block.id the_second_diffpat

loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 1243 1354 10
5.02 1264 1355 11
#...

data_ph1_dp1
_pd_phase.id               ph1
_pd_diffractogram.id       dp1
_pd_phase.block_id         Ni
_pd_block.diffractogram_id the_first_diffpat
_pd_phase.mass_percent     73

data_ph2_dp1
_pd_phase.id               ph2
_pd_diffractogram.id       dp1
_pd_phase.block_id         Si
_pd_block.diffractogram_id the_first_diffpat
_pd_phase.mass_percent     20

data_ph3_dp1
_pd_phase.id               ph3
_pd_diffractogram.id       dp1
_pd_phase.block_id         Fe
_pd_block.diffractogram_id the_first_diffpat
_pd_phase.mass_percent      7

data_ph1_dp2
_pd_phase.id               ph1
_pd_diffractogram.id       dp2
_pd_phase.block_id         Ni
_pd_block.diffractogram_id the_second_diffpat
_pd_phase.mass_percent     71

data_ph2_dp2
_pd_phase.id               ph2
_pd_diffractogram.id       dp2
_pd_phase.block_id         Si
_pd_block.diffractogram_id the_second_diffpat
_pd_phase.mass_percent     21

data_ph3_dp2
_pd_phase.id               ph3
_pd_diffractogram.id       dp2
_pd_phase.block_id         Fe
_pd_block.diffractogram_id the_second_diffpat
_pd_phase.mass_percent      8

jamesrhester commented 1 year ago

I'm not sure I like that approach. Now I need to either have a bunch of individual data blocks (which is error prone to produce), or use non-standard (non-default) CIF (and trust that software knows what to do) in order to list (for example) the QPA of some diffractograms:

I understand your distaste. This is why I'm assuming that the powder CIF community will specify their own preferred way of grouping results into data blocks so that pdCIF-conversant software will only have to understand that one particular way. If that way is not the "Default" schema then "Set" and "Loop" category types are not a limitation and you can loop "Set" category items as long as a category key has been assigned in the dictionary.

Note that PD_PHASE being a Set category is essentially pre-defined because cif_core puts only a single "phase" into a single data block, i.e. one structure (cell/atom sites/space group).

rowlesmr commented 1 year ago

But there still is only one diffraction pattern or phase in the data block, it just happens to want to reference many other phases or diffraction patterns, respectively. This even negates the original use of _pd_phase_id for linking a _pd_refln_peak_id with a _pd_block_id. @briantoby any input on this?

Shouldn't we try and bake desired behaviour into the standard, and not require users to bypass it to do what they want to do? .

Could there be a _pd_phase.ids and _pd_diffractogram.ids specifically for looping in this type of thing? But that then introduces another level of indirection.

briantoby commented 1 year ago

Don’t understand the issue

On Nov 22, 2022, at 8:30 AM, Matthew Rowles @.**@.>> wrote:

But there still is only one diffraction pattern or phase in the data block, it just happens to want to reference many other phases or diffraction patterns, respectively. This even negates the original use of _pd_phase_id for linking a _pd_refln_peak_id with a _pd_block_id. @briantobyhttps://github.com/briantoby any input on this?

jamesrhester commented 1 year ago

But there still is only one diffraction pattern or phase in the data block, it just happens to want to reference many other phases or diffraction patterns, respectively. This even negates the original use of _pd_phase_id for linking a _pd_refln_peak_id with a _pd_block_id. @briantoby any input on this?

So the issue as I understand it is that a diffraction pattern data block might want to point to data blocks containing all of the phases that are in this diffraction pattern, and a phase data block might want to point to all of the data blocks that contain diffractograms that include this phase. In either case restricting _pd_phase or _pd_diffractogram to be single-valued in those data blocks would make this impossible.

Could there be a _pd_phase.ids and _pd_diffractogram.ids specifically for looping in this type of thing? But that then introduces another level of indirection.

We are forced to conclude that the use of _pd_phase.id and _pd_phase.block_id are distinct and they should not be in the same category. How best to untangle? _pd_phase_block_id has historical precedence and is well entrenched in its DDL1 form. So how about we have a new category pd_phase_block to which only _pd_phase_block.id belongs? This is a Loop category. Then PD_PHASE is a separate Set category to hold overall information about the phase.

I don't think that _pd_block_diffractogram.id has the same problems as it is already distinct from pd_diffractogram as I have proposed it above.

Shouldn't we try and bake desired behaviour into the standard, and not require users to bypass it to do what they want to do? .

Absolutely. The Default choice of Set categories is dictated by the history of the CIF core, reflecting the original decision as to which categories are looped and which are not. Most software written for the core dictionary thus expects a single space group, a single set of cell parameters, that the atomic positions are for a single compound, and that a single set of measurements were performed at a single wavelength and single set of environmental conditions.

As long as the other dictionaries (powder, magnetism, incommensurate, twinning, etc.) do not change that choice of Set categories, data names from all dictionaries can be freely mixed and software will correctly interpret the contents. An incorrect interpretation results if, for example, a single data block were to contain structural information from multiple phases, as an incorrect density would be calculated from the atom site list (not to mention bonds etc.).

Thus the Default layout is an overall standard for all of CIF, which allows us a baseline way to combine data using different CIF dictionaries together: think a powder diffraction dataset containing diffractograms from neutron and synchrotron sources from a multi-phase sample for which one phase is a composite structure composed of two components and another phase is a magnetic structure. I am confident that there is a way to record this using the Default choice of Set categories without ever having done it myself, and that generic software (e.g. structure visualisation) will have no problems working with the relevant data blocks.

If we would like to mandate a particular, different, set of Set categories for powder data, the tool at our disposal is _audit.schema. By defining a value for this that is not Default, and requiring any data blocks that do not follow Default to set that value, it would be possible to validate a given set of data blocks for conformance to "the pdCIF requirements". The pdCIF dictionary would still define Set categories to be compatible with Default, because Set only has meaning for the Default schema. Unfortunately this mandating would not be fully machine-readable, as we haven't developed in that direction (but could).

briantoby commented 1 year ago

_pd_block_diffractogram_id and _pd_phase_block_id are already used in loops.

rowlesmr commented 1 year ago

We are forced to conclude that the use of _pd_phase.id and _pd_phase.block_id are distinct and they should not be in the same category. How best to untangle? _pd_phase_block_id has historical precedence and is well entrenched in its DDL1 form. So how about we have a new category pd_phase_block to which only _pd_phase_block.id belongs? This is a Loop category. Then PD_PHASE is a separate Set category to hold overall information about the phase.

Sorry to be short, but it's bed time.

From the DDL1 dictionary, _pd_phase_id and _pd_phase_block_id are already used in loops with respect to _pd_refln.phase_id in order to link reflections to phases, as in:

#\#CIF_1.1
data_diffpatt
_pd_block_id    diffpatt_0
loop_
    _pd_phase_id
    _pd_phase_block_id
    _pd_phase_mass_%
1   Al2O3_0 72.11(12)
2   Si_0    27.89(12)
loop_
    _refln_index_h
    _refln_index_k
    _refln_index_l
    _pd_refln_phase_id
    _refln_d_spacing
   0    0    6 1    2.165404
   1    1    1 2    3.135621
#...

Here, the short length of the _pd_phase_id string is used to reduce visual clutter in the reflection loop (ie by _pd_refln_phase_id pointing to _pd_phase_id instead of _pd_phase_block_id). In refactoring this, there still needs to be a way of mimicking this behaviour.

rowlesmr commented 1 year ago

How's this as a summary/suggestion:

We make PD_PHASE a Set category, with a single member _pd_phase.id. This data name exists to provide a unique key identifiying a block containing information about a phase. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above. Note that this means that DDL1 _pd_phase_id and DDLM _pd_phase.id are not interchangeable (unless you fallback to _audit.schema).
We create PD_PHASE_BLOCK as a Loop category, with members _pd_phase_block.id and _pd_phase_block.phase_id. _pd_phase_block.id retains its current definition and use. _pd_phase_block.phase_id is a replacement data name for the DDL1 _pd_phase_id, inasmuch as it allows you to list multiple phases belonging to a histogram and link them with reflections, PO corrections and other things. This is an extension to James' above suggestion. This retains the equivalence of DDL1 _pd_phase_block_id and DDLM _pd_phase_block.id (to the user, at least).
We create PD_DIFFRACTOGRAM as a Set category, with a single member _pd_diffractogram.id. This data name exists to provide a unique key identifiying a block containing information about a diffraction pattern. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above.
We retain PD_BLOCK_DIFFRACTOGRAM* as a Loop category, with member _pd_block_diffractogram.id and add _pd_block_diffractogram.diffractogram_id. _pd_block_diffractogram.id retains its current definition and use. _pd_block_diffractogram.diffractogram_id acts in the same way as _pd_phase_block.phase_id, inasmuch as it allows you to list multiple diffractograms in which a phase appears and link them PO corrections and other things. This is an extension to James' above suggestion.

Next: What to do with _pd_phase.mass_percent and _pd_phase.name (which currently live in PD_PHASE as a Loop category)?

In my limited experience, _pd_phase.name** only seems to make sense as a single-valued data item providing a name in the block which gives the cell params and other "normal" details of a phase; it could probably stay in PD_PHASE as a Set category. Is there a way of easily searching the COD to see what is in there wrt phase.name?

_pd_phase.mass_percent probably really only makes sense to be used in the block containing the histogram in which various phases exist, and therefore, would be looped with _pd_phase_block.id (and/or _pd_phase_block.phase_id) to link the QPA to each phase. I don't think it sounds right to put it in PD_PHASE_BLOCK, as this is talking purely about the block, not the phase therein. Could we cheat a little and also create PD_PHASE_MASS and have _pd_phase_mass.percent? It retains the ability to change the "." to a "_" and be directly translateable...

I think this retains current capabilities and extends modern CIF linking to (at least a part of) pdCIF. @jamesrhester @briantoby ?

*Also, why is it PD_BLOCK_DIFFRACTOGRAM, and not PD_DIFFRACTOGRAM_BLOCK? ** _pd_phase.name should probably have contents of Text, and not Code.

jamesrhester commented 1 year ago

How's this as a summary/suggestion:

We make PD_PHASE a Set category, with a single member _pd_phase.id. This data name exists to provide a unique key identifiying a block containing information about a phase. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above. Note that this means that DDL1 _pd_phase_id and DDLM _pd_phase.id are not > interchangeable (unless you fallback to _audit.schema).

Agreed that we have to do this.

We create PD_PHASE_BLOCK as a Loop category, with members _pd_phase_block.id and _pd_phase_block.phase_id. _pd_phase_block.id retains its current definition and use. _pd_phase_block.phase_id is a replacement data name for the DDL1 _pd_phase_id, inasmuch as it allows you to list multiple phases belonging to a histogram and link them with reflections, PO corrections and other things. This is an extension to James' above suggestion. This retains the equivalence of DDL1 _pd_phase_block_id and DDLM _pd_phase_block.id (to the user, at least).

Тhe way in which Set categories work is that all child data names of the set category data name (_pd_phase.id) if the child data names are one of the key data names of their category have the value of the parent data name and so don't need to be explicitly stated. This would then be true of _pd_phase_block.phase_id if it is a key data name in the loop. Of course, we want the opposite, that is, multiple values of ...phase_id. However, if _pd_phase_block.id is the only key data name then this suggestion would work, and given that no block ids would ever be repeated in the loop (so it is a key), this is fine.

We create PD_DIFFRACTOGRAM as a Set category, with a single member _pd_diffractogram.id. This data name exists to provide a unique key identifiying a block containing information about a diffraction pattern. This provides the "minimum requirement for linking information between blocks in CIF". This is as proposed by James, above.

Yes, though note that it doesn't identify a block as such, it assigns a value to _pd_diffractogram.id which is the value for all child key data names of _pd_diffractogram.id in the same block.

We retain PD_BLOCK_DIFFRACTOGRAM* as a Loop category, with member _pd_block_diffractogram.id and add _pd_block_diffractogram.diffractogram_id. _pd_block_diffractogram.id retains its current definition and use. _pd_block_diffractogram.diffractogram_id acts in the same way as _pd_phase_block.phase_id, inasmuch as it allows you to list multiple diffractograms in which a phase appears and link them PO corrections and other things. This is an extension to James' above suggestion.

Ok, so we do the same as for PD_PHASE_BLOCK, with PD_BLOCK_DIFFRACTOGRAM having single key data name _pd_block_diffractogram.id, then we can have multiple values of _pd_block_diffractogram.diffractogram_id.

Note in these solutions for PD_PHASE_BLOCK and PD_BLOCK_DIFFRACTOGRAM the phase/diffractogram identifiers are optional, which is correct, as the vital information to be conveyed is the block name, and the name of the phase/diffractogram can be found from the block that is pointed to if it is missing.

Next: What to do with _pd_phase.mass_percent and _pd_phase.name (which currently live in PD_PHASE as a Loop category)?

In my limited experience, _pd_phase.name** only seems to make sense as a single-valued data item providing a name in the block which gives the cell params and other "normal" details of a phase; it could probably stay in PD_PHASE as a Set category. Is there a way of easily searching the COD to see what is in there wrt phase.name?

Agree.

_pd_phase.mass_percent probably really only makes sense to be used in the block containing the histogram in which various phases exist, and therefore, would be looped with _pd_phase_block.id (and/or _pd_phase_block.phase_id) to link the QPA to each phase. I don't think it sounds right to put it in PD_PHASE_BLOCK, as this is talking purely about the block, not the phase therein. Could we cheat a little and also create PD_PHASE_MASS and have _pd_phase_mass.percent? It retains the ability to change the "." to a "_" and be directly translateable...

_pd_phase.mass_percent depends on both the phase and the diffractogram, so it belongs to some category that has both phase and diffractogram as key data names (noting that many categories are current missing their implicit _pd_diffractogram.id key data name as it hasn't yet been accepted). If there are no obvious candidates then I agree that a PD_PHASE_MASS category could be created. This is really a separate issue to the current pull request so perhaps best moved to a new issue.

I think this retains current capabilities and extends modern CIF linking to (at least a part of) pdCIF. @jamesrhester @briantoby ?

.

*Also, why is it PD_BLOCK_DIFFRACTOGRAM, and not PD_DIFFRACTOGRAM_BLOCK? ** _pd_phase.name should probably have contents of Text, and not Code.

No idea on the historical naming choices, agree on Text not Code.

rowlesmr commented 1 year ago

I think I've added all the required data items. No promises on correct definitions, linking, category_keys....

Feel free to ignore/revert.

One thing I just noticed, _pd_phase.id must be unique within the entire CIF. _pd_phase_id only had to be unique within the block it was used.

jamesrhester commented 1 year ago

I think I've added all the required data items. No promises on correct definitions, linking, category_keys....

Feel free to ignore/revert.

I'll do some cosmetic editing but looks like a good start.

One thing I just noticed, _pd_phase.id must be unique within the entire CIF. _pd_phase_id only had to be unique within the block it was used.

If the same value of _pd_phase.id occurs somewhere else in the dataset (whether the data set is a single CIF file or a collection of files of varying formats), the contents of that block must be referring to the same phase. Picking up inconsistencies should be an easy task for validation.

rowlesmr commented 1 year ago

I added a missing description for PD_PHASE_MASS. I think it looks OK now, @vaitkus, can you have a look?

rowlesmr commented 1 year ago

probably in another PR, _pd_phase_mass.diffractogram_id and _pd_phase_mass.phase_id need dREL to get their respective values from _pd_diffractogram.id and _pd_phase.id.

rowlesmr commented 1 year ago

I think I'm happy with this. Do you want @vaitkus to review?

jamesrhester commented 1 year ago

Let's wait a bit for @vaitkus, if he's busy we'll merge and rely on a formal release process later on for any further review.

rowlesmr commented 1 year ago

Happy Winter holidays

😐

Either way, an additional set of eyes with the general knowledge of DDLm does not hurt either.

As you've just shown!

Some category names were slightly incorrect which resulted in name duplication. I suggested fixed directly in the PR. Not sure why these issues were not detected by the automated check, probably due to a slightly outdated version of cif_ddlm_dic_check being used (easy fix in the future).

The human-readable definitions of _pd_data.diffractogram_id, _pd_meas.diffractogram_id and _pd_proc.diffractogram_id all had the same text so I tried to tie it more to the categories that these items belong to. Changes suggested directly in the PR. Feel free to rephrase it in a more readable way.

I've commited these. And need to go back and change them for line length!

The content type of the _pd_diffractogram.id and the types of all items that point to it should probably be changed from Code to Text. This is necessary because according to the human-readable definition of the _pd_diffractogram.id data item and the associated dREL code, the value of this items can be used interchangeably with _pd_block.id which is of the Text content type. Due to this we may run into some incompatibles down the line.

I'll wait for comment from @jamesrhester on this. One thing of note, the description of _pd_block.id says "Blank spaces may also not be used. Capitalization may be used within the ID code but should not be considered significant - searches for data-set ID names should be case-insensitive": This lines up more with Code, than Text; should the description be updated, or the content type?

The _definition.update date in the definitions of the _pd_block_diffractogram.id, PD_DATA, PD_CALC, PD_MEAS, PD_PROC should also be updated since these definitions were seemingly changed.

The change was that the appropriate _pd*.diffractogram_id was added to the category key. I've put in 2022-10-11, as that is the modified date for PD_DIFFRACTOGRAM.

Some of the human-readable descriptions now contain references to outdated data names that no longer exist. Specifically, the following changes should be made:

save_PD_BLOCK: _pd_calib.std_external_block_id -> _pd_calib_std.external_block_id.

save_PD_BLOCK: _pd_phase.block_id -> pd_phase_block.id.

save_pd_calc_component.block_id: _pd_phase.block_id -> pd_phase_block.id.

I've changed these and found two other _pd_phase.block_id ones. _pd_calib_std.external_block_id will possibly change again.

Alternatively, some alias may have accidentally been left out from the new definitions.

I haven't checked for these.

rowlesmr commented 1 year ago

I also have two question that may require slightly more discussion:

I would really appreciate if an example file illustrating the usage of data relationships introduced by this PR could be added (e.g. under examples/). Since several example were already presented in the discussion of this PR, maybe one of them could turned into a proper CIF file? It is not necessary to update the GitHub check to automatically validate these example files at this point, but at least having a working example would be extremely useful for testing and the general understanding of the concept.

Easy one first: Yes, we should work up a proper example. I've also mentioned this in #47, but more of a more general one, as we need to wait for the all the (current) additions (especially this one) to be merged to properly before we can do that (large) task.

If one of the intents is to be able to distribute the same dataset in various alternative forms (e.g. as a single CIF file, as multiple files of potentially different data formats) maybe it would also make sense to introduce some kind of a _pd_dataset.uuid data item that would allow to easily link these separate files? If I correctly understand the current approach, the _pd_phase.id data item should generally be used for this purpose, however, I am unsure how well it would work outside of a specific context (e.g. if we encounter the same id in different files outside of an archive file, how sure can we be that they indeed describe the same dataset?). Use of a UUID would greatly reduce the probability of a id clash. The same can probably also be said for the _pd_block.id value format since it uses a timestamp as one of its components, but it does not seem that currently this format is enforced in the descriptions of _pd_diffractogram.id and related items (and potentially this format is not even desired for these fields due to its complexity/lack of brevity).

Potentially heretical thing coming.

From what I've gathered, @briantoby introduced _pd_block_id as a way to link between data blocks, as CIF (at that time) had no way of different data blocks to talk to each other, as everything was essentially single-phase, single-pattern data blocks. Block IDs were used to identify blocks containing either phase or diffractogram, or phase and diffractogram, hence the need for _pd_phase_block_id and _pd_block_diffractogram_id to differentiate between blocks containing phase or diffractogram information.

We now have _pd_phase.id* and _pd_diffractogram.id. If the construction of these ID values is brought inline with the current suggestion for _pd_block.id (ie more "uuid-like"), do we even need block IDs? I'm not knowledgable enough to make the call, but could the move to CIF_2.0 be used deprecate block IDs? To make the dictionary easier to maintain, could we split off deprecated data items into their own dictionary (cif_pow_deprecated.dic ?), such that they are still able to be used, but new additions to the dictionary would use _pd_phase.id and _pd_diffractogram.id to link between phase and diffractogram**.

* Yes, _pd_phase_id existed previously, but it was only there to link between _pd_phase_block_id and _pd_refln.phase_id.

** we will need to make something like _pd_phase.short_id and _pd_diffractogram.short_id, which have block-scope (rather than being unique) to link between things like _pd_refln.phase_id, _pd_pref_orient_March_Dollase.diffractogram_id, etc. so that we don't have to have huge, UUID-like values everywhere. In this example, _pd_phase.short_id would be an alias for _pd_phase_id. There would need to be a way to loop _pd_phase.short_idwith _pd_phase.id in a datablock to link the two together.

jamesrhester commented 1 year ago

The content type of the _pd_diffractogram.id and the types of all items that point to it should probably be changed from Code to Text. This is necessary because according to the human-readable definition of the _pd_diffractogram.id data item and the associated dREL code, the value of this items can be used interchangeably with _pd_block.id which is of the Text content type. Due to this we may run into some incompatibles down the line.

Yes, let's make it Text.

I agree that we should work up some full examplex once we have finished a few of our other current ongoing changes. Meanwhile, hopefully the examples in this PR are sufficient to assess how the mechanism works.

If one of the intents is to be able to distribute the same dataset in various alternative forms (e.g. as a single CIF file, as multiple files of potentially different data formats) maybe it would also make sense to introduce some kind of a _pd_dataset.uuid data item that would allow to easily link these separate files?

As far as I can tell no such mechanism is guaranteed to be robust in all situations. For example, if calibration files are included in a data set, as they should, then they will belong to all datasets that have used the same calibrations. Therefore every data block associated with the calibrations will need to have a new UUID inserted, that is, must be changed. There was a long discussion about some sort of way of linking all data blocks in a dataset together here, from which I drew the conclusion that the only universal way is for the context to dictate aggregation. This does not invalidate the use of block_id or a future dataset.uuid, just recognises the limitations.

In any case, discussion of a dataset.uuid data item could be raised as a separate issue in the cif_core repository but shouldn't be a blocker for the current issue.

We now have _pd_phase.id* and _pd_diffractogram.id. If the construction of these ID values is brought inline with the current suggestion for _pd_block.id (ie more "uuid-like"), do we even need block IDs? I'm not knowledgable enough to make the call, but could the move to CIF_2.0 be used deprecate block IDs? To make the dictionary easier to maintain, could we split off deprecated data items into their own dictionary (cif_pow_deprecated.dic ?), such that they are still able to be used, but new additions to the dictionary would use _pd_phase.id and _pd_diffractogram.id to link between phase and diffractogram**.

The block_id data names can only be deprecated, not removed, in order to maintain backwards compatibility. That said, as blocks do not exist in the relational model, I see these data names as non-ideal but in certain circumstances I guess are useful for validation. My intuition is that this way of linking blocks will become increasingly unwieldy for dictionary authors and software authors as pdCIF expands. Anyway, the best place for a discussion is a separate issue, not here.

rowlesmr commented 1 year ago

@vaitkus : if you're happy, I can merge.

vaitkus commented 1 year ago

Thank you for all the changes and for filing my questions as separate issues where relevant. It is indeed the better to continue the discussions there.

I'll wait for comment from @jamesrhester on this. One thing of note, the description of _pd_block.id says "Blank spaces may also not be used. Capitalization may be used within the ID code but should not be considered significant - searches for data-set ID names should be case-insensitive": This lines up more with Code, than Text; should the description be updated, or the content type?

@rowlesmr , this is an extremely good point. Given the properties that you highlighted (case-insensitivity, no spaces) the Code content type becomes nearly mandatory to ensure the correct automated processing of _pd_block.id (e.g. the validator program looks at the content type of the data item and decides if case-sensitivity is important when resolving references or checking uniqueness). However, I see that changes have already been made to change everything to Text. Not sure if we should change _pd_block.id, _pd_diffractogram.id and the related data items to Code in this PR or in a separate one. @jamesrhester , what is your opinion on this?

I also noticed that the PR branch now contains the '#Untitled-4#' which was accidentally introduced and later removed from the master branch. Should this file also be explicitly deleted in this PR or will syncing with the master branch/merging the PR automatically remove it?

I agree that we should work up some full examples once we have finished a few of our other current ongoing changes. Meanwhile, hopefully the examples in this PR are sufficient to assess how the mechanism works.

I will of course not push on this, but IMHO having a separate examples file that describe the ongoing discussion, even if incomplete, proved quite useful during the development of the TopoCif dictionary since it made immediately clear when changes broke something that they were not supposed to affect.

Anyways, I copied the example from the discussions to my machine as a separate file and with some minor changes (introduction of data_ elements) got it working. I tried validating it against the dictionary in the PR branch and ran into some issues. This might be the intended behaviour related to the new approach, but I just want to make sure it is:

Multiple data blocks (e.g. NISI_overall, NISI_phase_1, etc.) contain both the _pd_block_diffractogram_id data item and the _pd_block_id data item. Since the dictionary now formally links _pd_block_diffractogram_id to _pd_block.id, all the values of _pd_block_diffractogram_id must match the values of _pd_block.id. However, in multiple cases this is not true or even not possible when more than one _pd_block_diffractogram_id value is provide as in done in the NISI_overall data block. Is there a problem with the example, the data item relation or is this simply a new schema/approach that allows PD files to disregard the formal relationship imposed by the _name.linked_item_id attribute?

I approve the PR anyways in case these questions run the risk of going too much off topic again.

rowlesmr commented 1 year ago

Multiple data blocks (e.g. NISI_overall, NISI_phase_1, etc.) contain both the _pd_block_diffractogram_id data item and the _pd_block_id data item. Since the dictionary now formally links _pd_block_diffractogram_id to _pd_block.id, all the values of _pd_block_diffractogram_id must match the values of _pd_block.id. However, in multiple cases this is not true or even not possible when more than one _pd_block_diffractogram_id value is provide as in done in the NISI_overall data block. Is there a problem with the example, the data item relation or is this simply a new schema/approach that allows PD files to disregard the formal relationship imposed by the _name.linked_item_id attribute?

We don't want PD to be special and ignore the _name.linked_item_id attribute.

The intent here is to enforce the requirement that _pd_block_diffractogram.id be a valid _pd_block_id and that the data block(s) pointed to by _pd_block_diffractogram.id contains a diffractogram pertinant to the block containing the pointer.

My interpretation of what you're saying is that a block containing a _pd_block_diffractogram.id and a _pd_block_id must have, by definition, the same value associated with both data names.

I think the way to do what we want to do is to just have the purpose as Encode:

save_pd_block_diffractogram.id

    _definition.id                '_pd_block_diffractogram.id'
    _alias.definition_id          '_pd_block_diffractogram_id'
    _definition.update            2022-10-11
    _description.text
;
    A block ID code (see _pd_block.id) that identifies #...
;
    _name.category_id             pd_block_diffractogram
    _name.object_id               id
    _type.purpose                 Encode
    _type.source                  Assigned
    _type.container               Single
    _type.contents                Text

save_

jamesrhester commented 1 year ago

Multiple data blocks (e.g. NISI_overall, NISI_phase_1, etc.) contain both the _pd_block_diffractogram_id data item and the _pd_block_id data item. Since the dictionary now formally links _pd_block_diffractogram_id to _pd_block.id, all the values of _pd_block_diffractogram_id must match the values of _pd_block.id. However, in multiple cases this is not true or even not possible when more than one _pd_block_diffractogram_id value is provide as in done in the NISI_overall data block. Is there a problem with the example, the data item relation or is this simply a new schema/approach that allows PD files to disregard the formal relationship imposed by the _name.linked_item_id attribute?

I initially changed the dREL to properly reflect the idea that if _pd_diffractogram.id is missing then _pd_block.id can be used. But this will also not work, as dREL operates in a space where all data blocks have been merged and so the correct _pd_block.id to use must be determined using only the value of key data names in common. Not only are there none of these, but a single data block can have multiple _pd_block.id values. So I've just deleted the dREL. Thanks @vaitkus for picking this up.