Closed jamesrhester closed 1 year ago
This one is my fault. I've been thinking recently about plotting pd data from CIF, and what would be good things to be able to see.
My initial idea of a solution to document the contribution from each phase is something like:
data_diffraction_pattern_info
loop_
_pd_phase_id
_pd_phase_block_id
1 long_unique_string_1
2 long_unique_string_2
3 long_unique_string_3
loop_
_pd_data_point_id
_pd_meas_2theta_scan
_pd_calc_intensity_net
1 5.00 0
2 5.02 6
…
loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1 1 0
2 1 3
…
loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1 2 0
2 2 1
…
loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1 3 0
2 3 2
…
I think this is a good argument for the single-block CIF with _pd_phase.id. This would allow expansion by adding a new column for each phase rather than a new loop. In fact, the above is invalid unless each loop is put in a separate block, since each loop overwrites the previous data names.
Yes, @rowlesmr 's suggestion cannot work because you may not duplicate data names within a block. If each of the loops over _pd_data_point_id
were in separate data blocks, and each data block had a value of _pd_phase_id
within it, then it would work. It sort of looks like that was the original intention, as there were block pointers at the top of the example.
Yeah, just noticed that. Multiple instances of a data name in a single block result in issues.
A modification of my example would be something like below. Each crystalline phase belongs to only one diffraction pattern, and therefore has a unique profile. Each diffraction pattern has many phases. I think everything knows about everything else.
data_overall_insitu_experiment
# Many experimental patterns
# Each experimental pattern collected at a different temperatures, pressures, and/or times, but on the same instrument
# Each experimental pattern has many phases
# Each phase has only one experimental pattern
# Each phase has only one calculated profile
# Experiment probably done to report quantitative phase analysis
# insert common information here
loop_
_pd_phase_block_id
phase_1_pattern_1_unique_string
phase_2_pattern_1_unique_string
#...
loop_
_pd_block_diffractogram_id
pattern_1_unique_string
pattern_2_unique_string
#...
data_phase_1_pattern_1
_pd_block_id phase_1_pattern_1_unique_string
_pd_block_diffractogram_id pattern_1_unique_string
# crystal structure information would go here
loop_
_pd_data_point_id
_pd_calc_phase_intensity_net
1 0
2 3
#...
data_phase_2_pattern_1
_pd_block_id phase_2_pattern_1_unique_string
_pd_block_diffractogram_id pattern_1_unique_string
# crystal structure information would go here
loop_
_pd_data_point_id
_pd_calc_phase_intensity_net
1 0
2 1
#...
data_pattern_1
_pd_block_id pattern_1_unique_string
loop_
_pd_phase_id
_pd_phase_block_id
_pd_phase_mass_%
1 phase_1_pattern_1_unique_string 45.5
2 phase_2_pattern_1_unique_string 54.5
#time, temperature, pressure, other information
#hkl info goes here, too, probably.
loop_
_pd_data_point_id
_pd_meas_2theta_scan
_pd_meas_intensity_total
_pd_proc_ls_weight
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1 5.00 43.364 0.040297 25.962 25.962
2 5.01 38.007 0.050546 26.168 26.168
#...
# etc....
A more complicated example (taken from NISI.cif) is where each phase has multiple experimental patterns, and each pattern has multiple phases.
In this one:
The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id
.
The crystal structures know about their individual profiles through _pd_phase_block_id
(is that the correct way to do it?).
The crystal structures don't know about each other.
The individual profiles know about their crystal structure through _pd_phase_block_id
(is that the correct way to do it?).
The individual profiles of a crystal structure don't know about each other.
The diffraction patterns know about the crystal structures through _pd_phase_block_id
,
The diffraction patterns have no knowledge of the individual phase profiles (should they?).
data_overall_structure_determination
# Many experimental patterns, each collected the same temperature, pressures, and/or time, but on different instruments
# Each experimental pattern has many phases
# Each phase has many experimental patterns
# Each phase has many calculated profiles
# Experiment probably done to report crystal structure
# insert common information here
loop_
_pd_phase_block_id
phase_1_unique_string
phase_2_unique_string
loop_
_pd_block_diffractogram_id
xray_pattern_unique_string
cw_neutron_pattern_unique_string
data_phase_1
_pd_block_id phase_1_unique_string
loop_
_pd_block_diffractogram_id
xray_pattern_unique_string
cw_neutron_pattern_unique_string
loop_
_pd_phase_block_id
phase_1_xray_unique_string
phase_1_cw_unique_string
#crystal structure information
data_phase_1_xray
_pd_block_id phase_1_xray_unique_string
_pd_phase_block_id phase_1_unique_string
_pd_block_diffractogram_id xray_pattern_unique_string
loop_
_pd_data_point_id
_pd_calc_phase_intensity_net
1 0
2 1
#...
data_phase_1_cw
_pd_block_id phase_1_cw_unique_string
_pd_phase_block_id phase_1_unique_string
_pd_block_diffractogram_id cw_neutron_pattern_unique_string
loop_
_pd_data_point_id
_pd_calc_phase_intensity_net
1 0
2 3
#...
data_phase_2
# blah
data_phase_2_xray
# blah
data_phase_2_cw
# blah
data_xray_pattern
_pd_block_id xray_pattern_unique_string
_diffrn_radiation_wavelength 0.897654
loop_
_pd_phase_id
_pd_phase_block_id
1 phase_1_unique_string
2 phase_2_unique_string
loop_
_pd_data_point_id
_pd_meas_2theta_scan
_pd_meas_intensity_total
_pd_proc_ls_weight
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1 5.00 43.364 0.040297 25.962 25.962
2 5.01 38.007 0.050546 26.168 26.168
#...
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_d_spacing
2 0 0 1 o 16.505 16.060 1.76172
3 1 1 2 o 4.854 5.087 1.63708
2 2 2 2 o 0.000 0.000 1.56738
4 0 0 2 o 10.301 9.812 1.35739
2 2 0 1 o 15.566 15.195 1.24572
#...
data_cw_pattern
_pd_block_id cw_neutron_pattern_unique_string
_diffrn_radiation_wavelength 1.987
loop_
_pd_phase_id
_pd_phase_block_id
1 phase_1_unique_string
2 phase_2_unique_string
loop_
_pd_data_point_id
_pd_meas_2theta_scan
_pd_meas_intensity_total
_pd_proc_ls_weight
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1 10.00 43.364 0.040297 25.962 25.962
2 10.10 38.007 0.050546 26.168 26.168
#...
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_observed_status
_refln_F_squared_meas
_refln_F_squared_calc
_refln_d_spacing
4 0 0 2 o 9.773 9.812 1.35739
3 3 1 2 o 4.799 4.801 1.24563
2 2 0 1 o 15.254 15.195 1.24572
#...
Maybe my previous examples were a little too complex
Here I propose the following new data names
In this one: The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id. The crystal structures know about their individual profiles through _pd_profile_block_id. The crystal structures don't know about each other.
The individual profiles know about their diffraction pattern through _pd_block_diffractogram_id. The individual profiles of a crystal structure don't know about each other. The individual profiles know about their crystal structure through _pd_phase_block_id.
The diffraction patterns don't know about each other The diffraction patterns know about their individual profiles through _pd_profile_block_id The diffraction patterns know about their crystal structures through _pd_phase_block_id,
Anyway, I don't really know what I'm doing here, so I'll stop for now.
data_STR1_block
_pd_block_id STR1
loop_
_pd_block_diffractogram_id
XRAY
NEUTRON
loop_
_pd_profile_block_id
STR1_XRAY
STR1_NEUTRON
loop_
_refln_d_spacing
2.3
3.4
4.5
5.6
#other crystal structure information
data_STR2_block
_pd_block_id STR2
loop_
_pd_diffractogram_id
XRAY
NEUTRON
loop_
_pd_profile_block_id
STR2_XRAY
STR2_NEUTRON
loop_
_refln_d_spacing
2.35
3.45
4.55
5.65
#other crystal structure information
data_XRAY_block
_pd_block_id XRAY
loop_
_pd_phase_block_id
_pd_profile_block_id
STR1 STR1_XRAY
STR2 STR2_XRAY
loop_
_pd_meas_2theta_scan
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1 2 3 4
2 3 4 5
#etc
data_NEUTRON_block
_pd_block_id NEUTRON
loop_
_pd_phase_block_id
_pd_profile_block_id
STR1 STR1_NEUTRON
STR2 STR2_NEUTRON
loop_
_pd_meas_time_of_flight
_pd_proc_d_spacing
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1 2 3 4 5
2 3 4 5 6
#etc
data_STR1_XRAY_block
_pd_block_id STR1_XRAY
loop_
_pd_block_diffractogram_id
_pd_phase_block_id
XRAY STR1
loop_
_pd_meas_2theta_scan
_pd_proc_profile_intensity_total
1 2
2 3
#etc
data_STR1_NEUTRON_block
_pd_block_id STR1_NEUTRON
loop_
_pd_block_diffractogram_id
_pd_phase_block_id
NEUTRON STR1
loop_
_pd_proc_d_spacing
_pd_proc_profile_intensity_total
1 2
2 3
#etc
data_STR2_XRAY_block
_pd_block_id STR2_XRAY
loop_
_pd_block_diffractogram_id
_pd_phase_block_id
XRAY STR2
loop_
_pd_meas_2theta_scan
_pd_proc_profile_intensity_total
1 2
2 3
#etc
data_STR2_NEUTRON_block
_pd_block_id STR2_NEUTRON
loop_
_pd_block_diffractogram_id
_pd_phase_block_id
NEUTRON STR2
loop_
_pd_proc_d_spacing
_pd_proc_profile_intensity_total
1 2
2 3
#etc
It is not clear to me how the intensity information would be stored. As a reflection table? As I recall (perhaps incorrectly), the reflection table allows a phase id to be included, which means that the reflection table can be included in the dataset block. This seems like a cleaner way to handle things then set up a new block structure.
OTOH, there is the need to set up for n*m sets of profile descriptions (where there are n phases and m datasets). It might still be better to used a looped variable for that where a phase ID would be included in a table by dataset (not good to put them in a phase block, since the description used might vary by dataset type), this would be valuable if the definitions available for profile information were to be expanded.
Brian (T.)
On Nov 7, 2021, at 7:48 AM, rowlesmr @.**@.>> wrote:
Maybe my previous examples were a little too complex
Here I propose the following new data names
In this one: The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id. The crystal structures know about their individual profiles through _pd_profile_block_id. The crystal structures don't know about each other.
The individual profiles know about their diffraction pattern through _pd_block_diffractogram_id. The individual profiles of a crystal structure don't know about each other. The individual profiles know about their crystal structure through _pd_phase_block_id.
The diffraction patterns don't know about each other The diffraction patterns know about their individual profiles through _pd_profile_block_id The diffraction patterns know about their crystal structures through _pd_phase_block_id,
Anyway, I don't really know what I'm doing here, so I'll stop for now.
`data_STR1_block _pd_block_id STR1
loop_ _pd_block_diffractogram_id XRAY NEUTRON
loop_ _pd_profile_block_id STR1_XRAY STR1_NEUTRON
loop_ _refln_d_spacing 2.3 3.4 4.5 5.6
data_STR2_block _pd_block_id STR2
loop_ _pd_diffractogram_id XRAY NEUTRON
loop_ _pd_profile_block_id STR2_XRAY STR2_NEUTRON
loop_ _refln_d_spacing 2.35 3.45 4.55 5.65
data_XRAY_block _pd_block_id XRAY
loop_ _pd_phase_block_id _pd_profile_block_id STR1 STR1_XRAY STR2 STR2_XRAY
loop_ _pd_meas_2theta_scan _pd_meas_counts_total _pd_calc_intensity_total _pd_proc_intensity_bkg_calc 1 2 3 4 2 3 4 5
data_NEUTRON_block _pd_block_id NEUTRON
loop_ _pd_phase_block_id _pd_profile_block_id STR1 STR1_NEUTRON STR2 STR2_NEUTRON
loop_ _pd_meas_time_of_flight _pd_proc_d_spacing _pd_meas_counts_total _pd_calc_intensity_total _pd_proc_intensity_bkg_calc 1 2 3 4 5 2 3 4 5 6
data_STR1_XRAY_block _pd_block_id STR1_XRAY
loop_ _pd_block_diffractogram_id _pd_phase_block_id XRAY STR1
loop_ _pd_meas_2theta_scan _pd_proc_profile_total 1 2 2 3
data_STR1_NEUTRON_block _pd_block_id STR1_NEUTRON
loop_ _pd_block_diffractogram_id _pd_phase_block_id NEUTRON STR1
loop_ _pd_proc_d_spacing _pd_proc_profile_total 1 2 2 3
data_STR2_XRAY_block _pd_block_id STR2_XRAY
loop_ _pd_block_diffractogram_id _pd_phase_block_id XRAY STR2
loop_ _pd_meas_2theta_scan _pd_proc_profile_intensity_total 1 2 2 3
data_STR2_NEUTRON_block _pd_block_id STR2_NEUTRON
loop_ _pd_block_diffractogram_id _pd_phase_block_id NEUTRON STR2
loop_ _pd_proc_d_spacing _pd_proc_profile_intensity_total 1 2 2 3
`
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/COMCIFS/Powder_Dictionary/issues/3#issuecomment-962613749, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACH7E2CX22OEVNGWSVNBSXDUKZ7SJANCNFSM5D5JQP5A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
As a reflection table?
Yes, you can store reflections from individual phases together in a single table when you include _pd_refln_phase_id
loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_d_spacing
1 2 3 a 3.4
1 4 8 b 3.6
1 7 9 b 3.8
1 4 1 a 6.6
OTOH, there is the need to set up for n*m sets of profile descriptions (where there are n phases and m datasets).
Yes, this is clunky.
It might still be better to used a looped variable for that where a phase ID would be included in a table by dataset
does "dataset" mean "data block containing a diffraction pattern"? if so, there would need to be a bunch more keywords, but it would cut down on the number of blocks. You would need a profile
version of every possible ordinate you could use as X and Y (TOF, 2theta_meas, 2theta_corrected, d_spacing..., intensity, counts, net, total...)
This would definitely mimic a reflection table, just for every point in the diffraction pattern.
It could look something like:
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
loop_
_pd_phase_id
_pd_phase_block_id
a STR1
b STR2
loop_
_pd_meas_2theta_scan
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1.00 2 7 1
1.02 3 7 1
1.04 4 9 3
#etc
loop_
_pd_profile_meas_2theta_scan
_pd_profile_phase_id
_pd_profile_intensity_net
1.00 a 4
1.00 b 2
1.02 a 4
1.04 b 2
#etc
I think the time has come to figure out general principles for presenting complicated data. These principles would apply to PD as well as modulated + composite and any other complex dataset. The plan is to work these out for powder by imagining complicated scenarios and making sure they work. The following is a simple summary of what I've come up with so far. Note this is all in terms of DDLm dictionaries, DDL1 could never cope properly with the demands of any reasonably complex dataset. NB The use of block pointers addresses a separate problem that needn't complicate things here.
Set
category may only take a single value in a single data block. _audit.schema
corresponds to the Set
categories defined in the core + powder dictionaries_audit.schema
from the default value can change the categories that are Set
categories in a particular data blockSet
categories and thereby define how things are distributed over data blocks_audit.schema
for data blocks where we want to collect information from multiple data blocks.As I understand it, the way in which powder would like to split things up is to have information specific to a particular phase in separate data blocks. Therefore, in DDLm terms, pd_phase is a Set
category. This flows through to all "child" data names of _pd_phase.id
e.g. _pd_profile.phase_id
must also only take a single value in a single data block so you can't loop _pd_profile
as in the previous example, and the same goes for _pd_refln.phase_id
Cif_core specifies that diffrn
is a Set
category, so different experimental conditions/radiations should also be in separate data blocks. I think this means that there is one diffractogram per data block as well.
Now I gather that a "summary block" is desirable, where selected information found in the other blocks is collated. This would be where block pointers would be included, but it should be the case that the same information could be obtained by just reading in all of the other data blocks. In any case, the summary block would need to e.g. loop _pd_phase.id
and _diffrn.id
which means they are no longer Set
categories. The way to write such a block would be to set _audit.schema
to something like Powder Summary
(which we can define) and then loop to our heart's content.
I think this all started because @rowlesmr wanted to record the contributions of each phase to the calculated diffraction pattern. In the scheme posited above, this would require a separate tabulation in each data block corresponding to a particular diffraction pattern + particular phase, as well as a tabulation of the overall fit in each data block corresponding to a particular diffraction pattern (with no phase-specific information). This may seem vaguely wasteful of space due to the repetition of the 2 theta values, but the alternative would be to define a further _audit.schema
that allowed phases to be looped but not diffrn.
So my question is, does the above scheme cover all situations that you've encountered? Have I perhaps missed something else that should be separated into another data block?
I am afraid that I do not understand the meaning of “
Etc.”
So I am just not following the gist of what you are saying.
I now understand what is wanted to provide partial patterns by phase. From a logistics perspective one really wants all the partials in a single loop. What one really needs is a way to say a CIF name gets N values not 1 for every row in the table. I think star might have a quoting or grouping mechanism that allows this even if CIF does not.
Brian
Sent from a powerful small device but with weak eyes.
On Nov 8, 2021, at 2:38 AM, James Hester @.***> wrote:
Data names in a Set category may only take a single value in a single data block.
Apologies for the lack of clarity. In DDLm dictionaries, categories are classified as Set
or Loop
. Datanames in a Set
category may only have one value per data block (something like list = no
in DDL1), so if there are in fact many values (e.g. many phases) then having those phase_ids in a Set
category forces those phases to be listed in separate data blocks. Classifying categories between Set
and Loop
enables us to define how to present complex data unambiguously. So what I'm trying to pin down is exactly how we would like to do that. Note that the single value restriction applies only to the "topmost" data names, in our case _pd_phase.id
. Child data names (the ones that draw from its values) do not have to belong to Set
categories.
I now understand what is wanted to provide partial patterns by phase. From a logistics perspective one really wants all the partials in a single loop. What one really needs is a way to say a CIF name gets N values not 1 for every row in the table. I think star might have a quoting or grouping mechanism that allows this even if CIF does not.
The only way to do this in a single loop in even our most flexible interpretation of the relational model is to have a separate column labelling the phase this calculated intensity belongs to. So for two phases you would have what @rowlesmr proposed:
loop_
_pd_profile_meas_2theta_scan
_pd_profile_phase_id
_pd_profile_intensity_net
1.00 a 4
1.00 b 2
1.02 a 4
1.04 b 2
#etc
If that is what you would prefer then we can do that. I don't understand why having the partial pattern grouped together in a separate data block with the per phase, per histogram information is less practical though.
What do you mean by "logistically" when wanting the partials all in one loop?
If they all in one loop, you probably don't need the complexity of linking them to the structures and diffractograms, as you could just stick it in the diffractogram block and piggyback off the linking that is already there. If each profile is in it's own block, you do need to link everything, but you get the simplicity of "this block is the just for that phase in that other diffractogram".
In both cases, the total number of datapoints you're adding is the same, as you still need to repeat each datapoint in the measured data for each profile you want to record.
.
I should explain my "clunky" comment. Ideally, you could have a single loop that gives columns for 2theta, meas_intensity, calc_intensity, and then one column per individual profile, but that would either necessitate repeating the profile intensity dataname in a loop, or having an arbitrary number of datanames to hold profile_1, profile_2... intensities
The clunkiness arises from having to repeat 2theta values in different loops or blocks that already exist.
Here is my thinking on this: the goal of a loop_ structure is to bring together information that is related and shares a common ordinate. It is more difficult to relate such data when spread out over multiple loops and even harder when spread across blocks. The partial structure factor is definitely such a quantity, since in the end one probably wants to be able to see the partials superimposed or at least relate them, so I would really want them in a single loop. I really hate the idea of breaking up data across blocks that is logically and structurally linked. Now, something like 20 years after the introduction of multiple blocks for related information in pdCIF, do we yet have any software that assembles multiple blocks?
While less than ideal, here is one way to accommodate partials in the current syntax:
loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partials 5.00 “4 2 0” 5.02 “4 3 0” 5.04 “3 10 0”
loop_ _pd_profile_partials_phase_assignment a b c
Another would be this
loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partialA _pd_profile_intensity_partialB _pd_profile_intensity_partialC 5.00 4 2 0 5.02 4 3 0 5.04 3 10 0
Both have their disadvantages. Then again one could get inventive with CIF syntax and do something like this:
loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partial[ABC] 5.00 4 2 0 5.02 4 3 0 5.04 3 10 0 Or
loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partials 5.00 {4 2 0} 5.02 {4 3 0} 5.04 {3 10 0}
I would argue that a goal of CIF is to keep together all the information that shares a structure (using that term from a database perspective). One would really not want to encourage partials to be tabulated with different data ranges, step sizes etc., but why not if they are in logically disconnected structures?
Brian
On Nov 8, 2021, at 11:20 PM, rowlesmr @.**@.>> wrote:
What do you mean by "logistically" when wanting the partials all in one loop?
If they all in one loop, you probably don't need the complexity of linking them to the structures and diffractograms, as you could just stick it in the diffractogram block and piggyback off the linking that is already there. If each profile is in it's own block, you do need to link everything, but you get the simplicity of "this block is the just for that phase in that other diffractogram".
In both cases, the total number of datapoints you're adding is the same, as you still need to repeat each datapoint in the measured data for each profile you want to record.
.
I should explain my "clunky" comment. Ideally, you could have a single loop that gives columns for 2theta, meas_intensity, calc_intensity, and then one column per individual profile, but that would either necessitate repeating the profile intensity dataname in a loop, or having an arbitrary number of datanames to hold profile_1, profile_2... intensities
The clunkiness arises from having to repeat 2theta values in different loops or blocks that already exist.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
Now, something like 20 years after the introduction of multiple blocks for related information in pdCIF, do we yet have any software that assembles multiple blocks?
pip install pdCIFplotter
:)
(Just on that, Dave Billings should be emailing you and James about what has just started in the CPD)
.
I've only looked at the pictures in "DDLm: A New Dictionary Definition Language". Is it possible to have vectors, where their length is defined by another data item?
#using a mixture of old and new syntax, as I don't know how to upgrade...
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
loop_
_pd_phase_id
_pd_phase_block_id
a STR1
b STR2
c STR3
loop_
_pd_meas_2theta_scan
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
_pd_profile_intensity_net #length is defined as number of rows in _pd_phase_id ( or in _pd_phase_block_id)
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...
I am reminded of a quote I read once in a database book that I've never been able to find again: the relational model is always second-best, meaning that in any given situation you can find a more efficient, streamlined way to represent data, but the relational model will still be second-best when the situation changes, while your original streamlined approach is now much worse.
Anyway.
Matthew's suggestion of using CIF2 vectors would be workable, with another vector defined somewhere as per one of Brian's suggestions above to give the order of phases. There is no need to define a length for a CIF2 vector.
So, I've slightly expanded Matthew's example below. How does it look?
Notes on the example:
pd_phases
is a per-diffraction-pattern category for information about phases in general._pd_phases_presentation_order
itemdata_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases_presentation_order [a b c]
loop_
_pd_phase_id
_pd_phase_block_id
a STR1
b STR2
c STR3
loop_
_pd_meas_2theta_scan
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
_pd_profile_phase_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...
Is it not possible to automatically define _pd_phases_presentation_order [a b c]
from loop_ _pd_phase_id a b c
?
It would be easier to maintain the CIF file if I only need to write down the phases in one place. Although, I do recall from somewhere (pycifrw docs?) that row order isn't guaranteed in CIFs...
No, the order of rows is very deliberately not significant. I understand your concerns with writing down the phases in more than one place, this is a key concern of the relational model, which aims to minimise duplication of information. The "ideal" relational approach in our case would have every separate phase in a separate data block, with no "summary block", meaning you really would only write the phases down once, and then shuffle the data around after reading it in, to match your problem of the day.
Relevant to this issue is http://comcifs.github.io/accepted/multi-block-principles. The core dictionary combined with that document and PD Loop
/Set
dictionary decisions dictates how multi-wavelength/sample/diffraction condition/histogram data are distributed over multiple data blocks. I suggest we work together on a document that lays out the principles for PD, happy to draft a first attempt and put it up at comcifs.github.io for discussion. It would be great to have the PD commission involved as well, that can happen once a draft is up for discussion.
I've now drafted a document for ongoing discussion: https://github.com/COMCIFS/comcifs.github.io/blob/master/draft/powder_data_presentation.md
How about something like this?
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases.profile_presentation_order [a b c]
loop_
_pd_phase.id
_pd_phase.block_id
a STR1
b STR2
c STR3
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
_pd_phase.profile_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...
New data items:
_pd_phases.profile_presentation_order
_pd_phase.profile_intensity_net
_pd_phase.profile_intensity_total
As per previous discussions, I proposed that there would always be something like _pd_calc.phase_id
to offer the option of tabulating the calculated contribution in separate per-phase blocks. I realise now that this idea of mine is fundamentally wrong, because pdCIF has been set up to make it possible to tabulate measured and calculated intensities in a single loop, and there is no particular phase that measured intensities (in general!) belong to, therefore there cannot be a specific phase associated with this loop and the intensities-in-a-list proposal is therefore compatible with current pdCIF.
Therefore, any per-phase calculated intensity loop must be in a different (new) category, let's call it pd_calc_components
where the calculated intensities are listed for a single phase. By defining things this way, it is possible for the above component-intensities-in-a-list proposal and the per-phase listing proposal to be compatible and co-exist.
So that was a long-winded way of saying, yes, I have no objections to this proposal, as long as pd_calc_components
exists.
I only know enough to be dangerous, so questions:
isn't that what _pd_phases.profile_presentation_order
is doing? mapping a _pd_phase.profile_intensity_total
to a _pd_phase_id
to a _pd_phase_block_id
?
and PD_CALC_COMPONENTS
needs to be a child of PD_DATA
so everything can be looped nicely? Isn't that just shifting the issue down the line one step?
.
or is it something like:
_pd_phases.profile_presentation_order
is a matrix of _pd_phase.id
values, and
_pd_phases.profile_intensity_net|total
is a matrix of _pd_calc_components.profile_intensity_net|total
such that the order of values given in _pd_phases.profile_intensity_net
matches the order of phases given in _pd_phases.profile_presentation_order
.
or is it to do with a summary block listing all of the histograms, phases, component profiles, and the like?
.
Example time!
component-intensities-in-a-list:
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases.profile_presentation_order [a b c]
loop_
_pd_phase.id
_pd_phase.block_id
a STR1
b STR2
c STR3
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
_pd_phases.profile_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...
Per-phase listing
data_summary
#things go here
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
loop_
_pd_phase.id
_pd_phase.block_id
a STR1
b STR2
c STR3
loop_
_pd_meas.2theta_scan
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 120 120 100
5.02 123 121 100
#...
data_STR1XRAY_component_block
_pd_block.id STR1_XRAY
_pd_phase.block_id STR1
_pd_block.diffractogram_id XRAY
loop_
_pd_meas.2theta_scan
_pd_calc_components.profile_intensity_net
5.00 7
5.02 7
#...
data_STR2XRAY_component_block
_pd_block_id STR2_XRAY
_pd_phase.block_id STR2
_pd_block.diffractogram_id XRAY
loop_
_pd_meas.2theta_scan
_pd_proc.intensity_bkg_calc
_pd_calc_components.profile_intensity_total
5.00 100 108
5.02 100 108
#...
data_STR3XRAY_component_block
_pd_block_id STR3_XRAY
_pd_phase.block_id STR3
_pd_block.diffractogram_id XRAY
loop_
_pd_meas.2theta_scan
_pd_proc.intensity_bkg_calc
_pd_calc_components.profile_intensity_net
5.00 5
5.02 6
#...
So this is not to do with the summary block. By having the pd_calc_components
block we can ensure that the machine-readable part of the dictionary is able to capture as many links as possible between data names, which in turn means that as much as possible of the dictionary can be interpreted and manipulated automatically using the relational model. The profile_presentation_order
approach does capture the same relationships, but only if a programmer reads the text descriptions and implements the link between position in the list and phase, that is, the relationships are expressed outside of the relational model despite being expressible within the relational model. I know this is a bit of an abstract point, but experience shows that keeping as close as possible to the relational model keeps us robust against future changes.
Small point: lists (square-bracket-delimited values) are a CIF2 feature so any CIF reading software expecting CIF1 format is likely to fail rather than skipping over the value. Perhaps a more pedestrian reason for pd_calc_components
as well as an incentive to handle CIF2?
I've written out some dREL below to assure myself that not having pd_calc_components
is not going to render the data files somehow unable to be processed relationally. All seems fine so I can drop that pd_calc_components
requirement for now and we can simply add it in future if it becomes desirable for some relational reason. For now dropping it just means that pure dictionary-based software that wants all information to do with a phase will not access any per-phase per-point information, and there is the CIF2 thing I mentioned above.
Also, CIF allows the use of massive image arrays of numbers instead of the pure relational approach of a table of x,y positions and pixel intensity. So it is not like using an array to save space is new.
I've written out some dREL showing the precise relationships between these categories. Note how dREL forces us to explicitly specify exactly how total intensity is calculated (ie whether or not scale factors are used).
# dREL pseudo-code for handling profile_presentation_order type information
# A Category method for populating a pd_calc_components category from profile_intensity_net information
loop pd as pd_calc { # loop over the rows of pd_calc
for phase_num in 1:len(pd.profile_intensity_net) {
pd_calc_components.(point_id = pd.point_id,
phase_id = pd_phases.profile_presentation_order[phase_num],
profile_intensity_net = pd.profile_intensity_net[phase_num]
)
}
}
# dREL pseudo code for total intensity: attached to _pd_calc.intensity_total
# Called for every row in pd_calc
t = 0
loop pcc as pd_calc_components {
t = t + pcc.profile_intensity_net #Is this right? Do we need a scale factor? Background?
}
pd_calc.intensity_total = t
It is indeed possible to write dREL for the profile_presentation_order
case, skipping pd_calc_components
:
# dREL pseudo-code for calculating net total intensity, this is called for each row of pd_calc
t = 0
for i in pd_calc.profile_intensity_net {
t = t + pd_calc.profile_intensity_net[i]
}
pd_calc.intensity_total = t
If we need to access the scale for a particular phase we get instead:
# dREL pseudo-code for calculating net total intensity scaled by phase scale
t = 0
for i in pd_calc.profile_intensity_net {
ph = pd_phases.profile_presentation_order[i]
scale = pd_xxx.scale[ph] # Don't actually record the scale?
t = t + pd_calc.profile_intensity_net[i] * scale
}
pd_calc.intensity_total = t
I know this is a bit of an abstract point, but experience shows that keeping as close as possible to the relational model keeps us robust against future changes.
This sounds like a good reason to put it in.
Small point: lists (square-bracket-delimited values) are a CIF2 feature so any CIF reading software expecting CIF1 format is likely to fail rather than skipping over the value. Perhaps a more pedestrian reason for pd_calc_components as well as an incentive to handle CIF2?
I know the parser I'm fiddling around with writing for CIF1 just fails when it gets a '['. Pedestrian, but still legitimate.
.
_pd_calc.intensity_total
includes bkg and normalisation, and so is specified on the same scale as the observed intensities.
_pd_calc.intensity_net
does not contain bkg or normalisation and so is specified on the same scale as _pd_proc.intensity_net
.
so I think, strictly,
# dREL pseudo code for total intensity: attached to _pd_calc.intensity_total
# Called for every row in pd_calc
t = pd_proc.intensity_bkg_calc #I don't know if this is legitimate, but its what I want to do.
loop pcc as pd_calc_components {
t += pcc.profile_intensity_total - pd_proc.intensity_bkg_calc
}
pd_calc.intensity_total = t
^ With this definition, overlaying _pd_meas.intensity_total
and _pd_calc_components.profile_intensity_total
, means they'll line up and overlap nicely; there is no bkg offset, the intensities are on the same scale...
# dREL pseudo code for total intensity: attached to _pd_calc.intensity_net
# Called for every row in pd_calc
t = 0
loop pcc as pd_calc_components {
t += pcc.profile_intensity_net
}
pd_calc.intensity_net = t
^ This definition, requires that the bkg and normalisation correctsion are identication for each _pd_calc_components.profile_intensity_net
.
I think that the scale foactor you're looking for should be _pd_proc.intensity_norm
; _pd_proc.intensity_net
doesn't go into detail on where to enumerate the "correction and normalization factors" used.
Still need to add _pd_calc_component.phase_id
and _pd_calc_component.diffractogram_id
data names to indicate in a machine-readable way that the information in pd_calc_component
is per phase, per diffractogram.
Still need to add
_pd_calc_component.phase_id
and_pd_calc_component.diffractogram_id
data names to indicate in a machine-readable way that the information inpd_calc_component
is per phase, per diffractogram.
They are there.
Currently the calculated intensity
_pd_calc_intensity_net
is for the sum of all phases. It has been suggested that seeing the calculated contribution of each phase would also be useful for plotting. The sketch of a solution involves adding a child data name ofphase_id
to the pd_proc category.