COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction
4 stars 4 forks source link

Allow per-phase calculated intensity #3

Closed jamesrhester closed 1 year ago

jamesrhester commented 2 years ago

Currently the calculated intensity _pd_calc_intensity_net is for the sum of all phases. It has been suggested that seeing the calculated contribution of each phase would also be useful for plotting. The sketch of a solution involves adding a child data name of phase_id to the pd_proc category.

rowlesmr commented 2 years ago

This one is my fault. I've been thinking recently about plotting pd data from CIF, and what would be good things to be able to see.

My initial idea of a solution to document the contribution from each phase is something like:

data_diffraction_pattern_info

loop_
_pd_phase_id
_pd_phase_block_id
1          long_unique_string_1
2          long_unique_string_2
3          long_unique_string_3

loop_
_pd_data_point_id
_pd_meas_2theta_scan
_pd_calc_intensity_net
1          5.00     0
2          5.02     6
…

loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1          1          0
2          1          3
…

loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1          2          0
2          2          1
…

loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1          3          0
2          3          2
…
briantoby commented 2 years ago

I think this is a good argument for the single-block CIF with _pd_phase.id. This would allow expansion by adding a new column for each phase rather than a new loop. In fact, the above is invalid unless each loop is put in a separate block, since each loop overwrites the previous data names.

jamesrhester commented 2 years ago

Yes, @rowlesmr 's suggestion cannot work because you may not duplicate data names within a block. If each of the loops over _pd_data_point_id were in separate data blocks, and each data block had a value of _pd_phase_id within it, then it would work. It sort of looks like that was the original intention, as there were block pointers at the top of the example.

rowlesmr commented 2 years ago

Yeah, just noticed that. Multiple instances of a data name in a single block result in issues.

A modification of my example would be something like below. Each crystalline phase belongs to only one diffraction pattern, and therefore has a unique profile. Each diffraction pattern has many phases. I think everything knows about everything else.

data_overall_insitu_experiment
    # Many experimental patterns
    # Each experimental pattern collected at a different temperatures, pressures, and/or times, but on the same instrument
    # Each experimental pattern has many phases
    # Each phase has only one experimental pattern
    # Each phase has only one calculated profile
    # Experiment probably done to report quantitative phase analysis

    # insert common information here

    loop_
    _pd_phase_block_id
    phase_1_pattern_1_unique_string
    phase_2_pattern_1_unique_string
    #...

    loop_
    _pd_block_diffractogram_id
    pattern_1_unique_string
    pattern_2_unique_string
    #...

data_phase_1_pattern_1
    _pd_block_id    phase_1_pattern_1_unique_string
    _pd_block_diffractogram_id  pattern_1_unique_string

    # crystal structure information would go here

    loop_
    _pd_data_point_id
    _pd_calc_phase_intensity_net
    1   0
    2   3
    #...

data_phase_2_pattern_1
    _pd_block_id    phase_2_pattern_1_unique_string
    _pd_block_diffractogram_id  pattern_1_unique_string

    # crystal structure information would go here

    loop_
    _pd_data_point_id
    _pd_calc_phase_intensity_net
    1   0
    2   1
    #...

data_pattern_1  
    _pd_block_id    pattern_1_unique_string

    loop_
    _pd_phase_id
    _pd_phase_block_id
    _pd_phase_mass_%
    1   phase_1_pattern_1_unique_string 45.5
    2   phase_2_pattern_1_unique_string 54.5

    #time, temperature, pressure, other information
    #hkl info goes here, too, probably.

    loop_
    _pd_data_point_id
    _pd_meas_2theta_scan
    _pd_meas_intensity_total
    _pd_proc_ls_weight
    _pd_calc_intensity_total
    _pd_proc_intensity_bkg_calc
    1   5.00    43.364  0.040297    25.962  25.962  
    2   5.01    38.007  0.050546    26.168  26.168  
    #...

# etc....   

A more complicated example (taken from NISI.cif) is where each phase has multiple experimental patterns, and each pattern has multiple phases.

In this one: The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id. The crystal structures know about their individual profiles through _pd_phase_block_id (is that the correct way to do it?). The crystal structures don't know about each other. The individual profiles know about their crystal structure through _pd_phase_block_id (is that the correct way to do it?). The individual profiles of a crystal structure don't know about each other. The diffraction patterns know about the crystal structures through _pd_phase_block_id, The diffraction patterns have no knowledge of the individual phase profiles (should they?).

data_overall_structure_determination
    # Many experimental patterns, each collected the same temperature, pressures, and/or time, but on different instruments
    # Each experimental pattern has many phases
    # Each phase has many experimental patterns
    # Each phase has many calculated profiles
    # Experiment probably done to report crystal structure

    # insert common information here

    loop_
    _pd_phase_block_id
    phase_1_unique_string
    phase_2_unique_string

    loop_
    _pd_block_diffractogram_id
    xray_pattern_unique_string
    cw_neutron_pattern_unique_string

data_phase_1
    _pd_block_id    phase_1_unique_string

    loop_
    _pd_block_diffractogram_id
    xray_pattern_unique_string
    cw_neutron_pattern_unique_string

    loop_
    _pd_phase_block_id
    phase_1_xray_unique_string
    phase_1_cw_unique_string

    #crystal structure information

data_phase_1_xray   
    _pd_block_id    phase_1_xray_unique_string
    _pd_phase_block_id  phase_1_unique_string
    _pd_block_diffractogram_id  xray_pattern_unique_string

    loop_
    _pd_data_point_id
    _pd_calc_phase_intensity_net
    1   0
    2   1
    #...

data_phase_1_cw 
    _pd_block_id    phase_1_cw_unique_string
    _pd_phase_block_id  phase_1_unique_string
    _pd_block_diffractogram_id  cw_neutron_pattern_unique_string

    loop_
    _pd_data_point_id
    _pd_calc_phase_intensity_net
    1   0
    2   3
    #...

data_phase_2
# blah
data_phase_2_xray   
# blah
data_phase_2_cw 
# blah

data_xray_pattern
    _pd_block_id    xray_pattern_unique_string

    _diffrn_radiation_wavelength 0.897654

    loop_
    _pd_phase_id
    _pd_phase_block_id
    1   phase_1_unique_string
    2   phase_2_unique_string

    loop_
    _pd_data_point_id
    _pd_meas_2theta_scan
    _pd_meas_intensity_total
    _pd_proc_ls_weight
    _pd_calc_intensity_total
    _pd_proc_intensity_bkg_calc
    1   5.00    43.364  0.040297    25.962  25.962  
    2   5.01    38.007  0.050546    26.168  26.168  
    #...

    loop_
    _refln_index_h
    _refln_index_k
    _refln_index_l
    _pd_refln_phase_id
    _refln_observed_status
    _refln_F_squared_meas
    _refln_F_squared_calc
    _refln_d_spacing
    2   0   0  1 o  16.505  16.060  1.76172
    3   1   1  2 o   4.854   5.087  1.63708
    2   2   2  2 o   0.000   0.000  1.56738
    4   0   0  2 o  10.301   9.812  1.35739
    2   2   0  1 o  15.566  15.195  1.24572
    #...

data_cw_pattern
    _pd_block_id    cw_neutron_pattern_unique_string

    _diffrn_radiation_wavelength 1.987

    loop_
    _pd_phase_id
    _pd_phase_block_id
    1   phase_1_unique_string
    2   phase_2_unique_string

    loop_
    _pd_data_point_id
    _pd_meas_2theta_scan
    _pd_meas_intensity_total
    _pd_proc_ls_weight
    _pd_calc_intensity_total
    _pd_proc_intensity_bkg_calc
    1   10.00   43.364  0.040297    25.962  25.962  
    2   10.10   38.007  0.050546    26.168  26.168  
    #...

    loop_
    _refln_index_h
    _refln_index_k
    _refln_index_l
    _pd_refln_phase_id
    _refln_observed_status
    _refln_F_squared_meas
    _refln_F_squared_calc 
    _refln_d_spacing
    4   0   0  2 o   9.773   9.812  1.35739
    3   3   1  2 o   4.799   4.801  1.24563
    2   2   0  1 o  15.254  15.195  1.24572
    #...
rowlesmr commented 2 years ago

Maybe my previous examples were a little too complex

Here I propose the following new data names

In this one: The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id. The crystal structures know about their individual profiles through _pd_profile_block_id. The crystal structures don't know about each other.

The individual profiles know about their diffraction pattern through _pd_block_diffractogram_id. The individual profiles of a crystal structure don't know about each other. The individual profiles know about their crystal structure through _pd_phase_block_id.

The diffraction patterns don't know about each other The diffraction patterns know about their individual profiles through _pd_profile_block_id The diffraction patterns know about their crystal structures through _pd_phase_block_id,

Anyway, I don't really know what I'm doing here, so I'll stop for now.

data_STR1_block
    _pd_block_id STR1

    loop_
    _pd_block_diffractogram_id
    XRAY
    NEUTRON

    loop_
    _pd_profile_block_id
    STR1_XRAY
    STR1_NEUTRON

    loop_
    _refln_d_spacing
    2.3
    3.4
    4.5
    5.6 
    #other crystal structure information

data_STR2_block
    _pd_block_id STR2

    loop_
    _pd_diffractogram_id
    XRAY
    NEUTRON

    loop_
    _pd_profile_block_id
    STR2_XRAY
    STR2_NEUTRON

    loop_
    _refln_d_spacing
    2.35
    3.45
    4.55
    5.65
    #other crystal structure information

data_XRAY_block
    _pd_block_id XRAY

    loop_
    _pd_phase_block_id
    _pd_profile_block_id
    STR1 STR1_XRAY
    STR2 STR2_XRAY

    loop_
    _pd_meas_2theta_scan
    _pd_meas_counts_total
    _pd_calc_intensity_total
    _pd_proc_intensity_bkg_calc
    1 2 3 4
    2 3 4 5
    #etc

data_NEUTRON_block
    _pd_block_id NEUTRON

    loop_
    _pd_phase_block_id
    _pd_profile_block_id
    STR1 STR1_NEUTRON
    STR2 STR2_NEUTRON

    loop_
    _pd_meas_time_of_flight
    _pd_proc_d_spacing
    _pd_meas_counts_total
    _pd_calc_intensity_total
    _pd_proc_intensity_bkg_calc
    1 2 3 4 5
    2 3 4 5 6
    #etc

data_STR1_XRAY_block
    _pd_block_id STR1_XRAY

    loop_
    _pd_block_diffractogram_id
    _pd_phase_block_id
    XRAY STR1

    loop_
    _pd_meas_2theta_scan
    _pd_proc_profile_intensity_total
    1 2
    2 3
    #etc

data_STR1_NEUTRON_block
    _pd_block_id STR1_NEUTRON

    loop_
    _pd_block_diffractogram_id
    _pd_phase_block_id
    NEUTRON STR1

    loop_
    _pd_proc_d_spacing
    _pd_proc_profile_intensity_total
    1 2
    2 3
    #etc

data_STR2_XRAY_block
    _pd_block_id STR2_XRAY

    loop_
    _pd_block_diffractogram_id
    _pd_phase_block_id
    XRAY STR2

    loop_
    _pd_meas_2theta_scan
    _pd_proc_profile_intensity_total
    1 2
    2 3
    #etc

data_STR2_NEUTRON_block
    _pd_block_id STR2_NEUTRON

    loop_
    _pd_block_diffractogram_id
    _pd_phase_block_id
    NEUTRON STR2

    loop_
    _pd_proc_d_spacing
    _pd_proc_profile_intensity_total
    1 2
    2 3
    #etc
briantoby commented 2 years ago

It is not clear to me how the intensity information would be stored. As a reflection table? As I recall (perhaps incorrectly), the reflection table allows a phase id to be included, which means that the reflection table can be included in the dataset block. This seems like a cleaner way to handle things then set up a new block structure.

OTOH, there is the need to set up for n*m sets of profile descriptions (where there are n phases and m datasets). It might still be better to used a looped variable for that where a phase ID would be included in a table by dataset (not good to put them in a phase block, since the description used might vary by dataset type), this would be valuable if the definitions available for profile information were to be expanded.

Brian (T.)

On Nov 7, 2021, at 7:48 AM, rowlesmr @.**@.>> wrote:

Maybe my previous examples were a little too complex

Here I propose the following new data names

In this one: The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id. The crystal structures know about their individual profiles through _pd_profile_block_id. The crystal structures don't know about each other.

The individual profiles know about their diffraction pattern through _pd_block_diffractogram_id. The individual profiles of a crystal structure don't know about each other. The individual profiles know about their crystal structure through _pd_phase_block_id.

The diffraction patterns don't know about each other The diffraction patterns know about their individual profiles through _pd_profile_block_id The diffraction patterns know about their crystal structures through _pd_phase_block_id,

Anyway, I don't really know what I'm doing here, so I'll stop for now.

`data_STR1_block _pd_block_id STR1

loop_ _pd_block_diffractogram_id XRAY NEUTRON

loop_ _pd_profile_block_id STR1_XRAY STR1_NEUTRON

loop_ _refln_d_spacing 2.3 3.4 4.5 5.6

other crystal structure information

data_STR2_block _pd_block_id STR2

loop_ _pd_diffractogram_id XRAY NEUTRON

loop_ _pd_profile_block_id STR2_XRAY STR2_NEUTRON

loop_ _refln_d_spacing 2.35 3.45 4.55 5.65

other crystal structure information

data_XRAY_block _pd_block_id XRAY

loop_ _pd_phase_block_id _pd_profile_block_id STR1 STR1_XRAY STR2 STR2_XRAY

loop_ _pd_meas_2theta_scan _pd_meas_counts_total _pd_calc_intensity_total _pd_proc_intensity_bkg_calc 1 2 3 4 2 3 4 5

etc

data_NEUTRON_block _pd_block_id NEUTRON

loop_ _pd_phase_block_id _pd_profile_block_id STR1 STR1_NEUTRON STR2 STR2_NEUTRON

loop_ _pd_meas_time_of_flight _pd_proc_d_spacing _pd_meas_counts_total _pd_calc_intensity_total _pd_proc_intensity_bkg_calc 1 2 3 4 5 2 3 4 5 6

etc

data_STR1_XRAY_block _pd_block_id STR1_XRAY

loop_ _pd_block_diffractogram_id _pd_phase_block_id XRAY STR1

loop_ _pd_meas_2theta_scan _pd_proc_profile_total 1 2 2 3

etc

data_STR1_NEUTRON_block _pd_block_id STR1_NEUTRON

loop_ _pd_block_diffractogram_id _pd_phase_block_id NEUTRON STR1

loop_ _pd_proc_d_spacing _pd_proc_profile_total 1 2 2 3

etc

data_STR2_XRAY_block _pd_block_id STR2_XRAY

loop_ _pd_block_diffractogram_id _pd_phase_block_id XRAY STR2

loop_ _pd_meas_2theta_scan _pd_proc_profile_intensity_total 1 2 2 3

etc

data_STR2_NEUTRON_block _pd_block_id STR2_NEUTRON

loop_ _pd_block_diffractogram_id _pd_phase_block_id NEUTRON STR2

loop_ _pd_proc_d_spacing _pd_proc_profile_intensity_total 1 2 2 3

etc

`

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/COMCIFS/Powder_Dictionary/issues/3#issuecomment-962613749, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACH7E2CX22OEVNGWSVNBSXDUKZ7SJANCNFSM5D5JQP5A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

rowlesmr commented 2 years ago

As a reflection table?

Yes, you can store reflections from individual phases together in a single table when you include _pd_refln_phase_id

loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_d_spacing
1 2 3 a 3.4
1 4 8 b 3.6
1 7 9 b 3.8
1 4 1 a 6.6

OTOH, there is the need to set up for n*m sets of profile descriptions (where there are n phases and m datasets).

Yes, this is clunky.

It might still be better to used a looped variable for that where a phase ID would be included in a table by dataset

does "dataset" mean "data block containing a diffraction pattern"? if so, there would need to be a bunch more keywords, but it would cut down on the number of blocks. You would need a profile version of every possible ordinate you could use as X and Y (TOF, 2theta_meas, 2theta_corrected, d_spacing..., intensity, counts, net, total...)

This would definitely mimic a reflection table, just for every point in the diffraction pattern.

It could look something like:

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY

loop_
_pd_phase_id
_pd_phase_block_id
a STR1 
b STR2 

loop_
_pd_meas_2theta_scan
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1.00 2 7 1
1.02 3 7 1
1.04 4 9 3
#etc

loop_
_pd_profile_meas_2theta_scan
_pd_profile_phase_id
_pd_profile_intensity_net
1.00 a 4
1.00 b 2
1.02 a 4
1.04 b 2
#etc
jamesrhester commented 2 years ago

I think the time has come to figure out general principles for presenting complicated data. These principles would apply to PD as well as modulated + composite and any other complex dataset. The plan is to work these out for powder by imagining complicated scenarios and making sure they work. The following is a simple summary of what I've come up with so far. Note this is all in terms of DDLm dictionaries, DDL1 could never cope properly with the demands of any reasonably complex dataset. NB The use of block pointers addresses a separate problem that needn't complicate things here.

Key information:

  1. Data names in a Set category may only take a single value in a single data block.
  2. The default value of _audit.schema corresponds to the Set categories defined in the core + powder dictionaries
  3. A change in the value of data name _audit.schema from the default value can change the categories that are Set categories in a particular data block

Tasks:

  1. To define which categories in the powder CIF dictionary are Set categories and thereby define how things are distributed over data blocks
  2. To define a non-default value of _audit.schema for data blocks where we want to collect information from multiple data blocks.

As I understand it, the way in which powder would like to split things up is to have information specific to a particular phase in separate data blocks. Therefore, in DDLm terms, pd_phase is a Set category. This flows through to all "child" data names of _pd_phase.id e.g. _pd_profile.phase_id must also only take a single value in a single data block so you can't loop _pd_profile as in the previous example, and the same goes for _pd_refln.phase_id

Cif_core specifies that diffrn is a Set category, so different experimental conditions/radiations should also be in separate data blocks. I think this means that there is one diffractogram per data block as well.

Now I gather that a "summary block" is desirable, where selected information found in the other blocks is collated. This would be where block pointers would be included, but it should be the case that the same information could be obtained by just reading in all of the other data blocks. In any case, the summary block would need to e.g. loop _pd_phase.id and _diffrn.id which means they are no longer Set categories. The way to write such a block would be to set _audit.schema to something like Powder Summary (which we can define) and then loop to our heart's content.

I think this all started because @rowlesmr wanted to record the contributions of each phase to the calculated diffraction pattern. In the scheme posited above, this would require a separate tabulation in each data block corresponding to a particular diffraction pattern + particular phase, as well as a tabulation of the overall fit in each data block corresponding to a particular diffraction pattern (with no phase-specific information). This may seem vaguely wasteful of space due to the repetition of the 2 theta values, but the alternative would be to define a further _audit.schema that allowed phases to be looped but not diffrn.

So my question is, does the above scheme cover all situations that you've encountered? Have I perhaps missed something else that should be separated into another data block?

briantoby commented 2 years ago

I am afraid that I do not understand the meaning of “

  1. Data names in a Set category may only take a single value in a single data block.

Etc.”

So I am just not following the gist of what you are saying.

I now understand what is wanted to provide partial patterns by phase. From a logistics perspective one really wants all the partials in a single loop. What one really needs is a way to say a CIF name gets N values not 1 for every row in the table. I think star might have a quoting or grouping mechanism that allows this even if CIF does not.

Brian

Sent from a powerful small device but with weak eyes.

On Nov 8, 2021, at 2:38 AM, James Hester @.***> wrote:

Data names in a Set category may only take a single value in a single data block.

jamesrhester commented 2 years ago

Apologies for the lack of clarity. In DDLm dictionaries, categories are classified as Set or Loop. Datanames in a Set category may only have one value per data block (something like list = no in DDL1), so if there are in fact many values (e.g. many phases) then having those phase_ids in a Set category forces those phases to be listed in separate data blocks. Classifying categories between Set and Loop enables us to define how to present complex data unambiguously. So what I'm trying to pin down is exactly how we would like to do that. Note that the single value restriction applies only to the "topmost" data names, in our case _pd_phase.id. Child data names (the ones that draw from its values) do not have to belong to Set categories.

I now understand what is wanted to provide partial patterns by phase. From a logistics perspective one really wants all the partials in a single loop. What one really needs is a way to say a CIF name gets N values not 1 for every row in the table. I think star might have a quoting or grouping mechanism that allows this even if CIF does not.

The only way to do this in a single loop in even our most flexible interpretation of the relational model is to have a separate column labelling the phase this calculated intensity belongs to. So for two phases you would have what @rowlesmr proposed:

loop_
_pd_profile_meas_2theta_scan
_pd_profile_phase_id
_pd_profile_intensity_net
1.00 a 4
1.00 b 2
1.02 a 4
1.04 b 2
#etc

If that is what you would prefer then we can do that. I don't understand why having the partial pattern grouped together in a separate data block with the per phase, per histogram information is less practical though.

rowlesmr commented 2 years ago

What do you mean by "logistically" when wanting the partials all in one loop?

If they all in one loop, you probably don't need the complexity of linking them to the structures and diffractograms, as you could just stick it in the diffractogram block and piggyback off the linking that is already there. If each profile is in it's own block, you do need to link everything, but you get the simplicity of "this block is the just for that phase in that other diffractogram".

In both cases, the total number of datapoints you're adding is the same, as you still need to repeat each datapoint in the measured data for each profile you want to record.

.

I should explain my "clunky" comment. Ideally, you could have a single loop that gives columns for 2theta, meas_intensity, calc_intensity, and then one column per individual profile, but that would either necessitate repeating the profile intensity dataname in a loop, or having an arbitrary number of datanames to hold profile_1, profile_2... intensities

The clunkiness arises from having to repeat 2theta values in different loops or blocks that already exist.

briantoby commented 2 years ago

Here is my thinking on this: the goal of a loop_ structure is to bring together information that is related and shares a common ordinate. It is more difficult to relate such data when spread out over multiple loops and even harder when spread across blocks. The partial structure factor is definitely such a quantity, since in the end one probably wants to be able to see the partials superimposed or at least relate them, so I would really want them in a single loop. I really hate the idea of breaking up data across blocks that is logically and structurally linked. Now, something like 20 years after the introduction of multiple blocks for related information in pdCIF, do we yet have any software that assembles multiple blocks?

While less than ideal, here is one way to accommodate partials in the current syntax:

loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partials 5.00 “4 2 0” 5.02 “4 3 0” 5.04 “3 10 0”

loop_ _pd_profile_partials_phase_assignment a b c

Another would be this

loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partialA _pd_profile_intensity_partialB _pd_profile_intensity_partialC 5.00 4 2 0 5.02 4 3 0 5.04 3 10 0

Both have their disadvantages. Then again one could get inventive with CIF syntax and do something like this:

loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partial[ABC] 5.00 4 2 0 5.02 4 3 0 5.04 3 10 0 Or

loop_ _pd_profile_meas_2theta_scan _pd_profile_intensity_partials 5.00 {4 2 0} 5.02 {4 3 0} 5.04 {3 10 0}

I would argue that a goal of CIF is to keep together all the information that shares a structure (using that term from a database perspective). One would really not want to encourage partials to be tabulated with different data ranges, step sizes etc., but why not if they are in logically disconnected structures?

Brian

On Nov 8, 2021, at 11:20 PM, rowlesmr @.**@.>> wrote:

What do you mean by "logistically" when wanting the partials all in one loop?

If they all in one loop, you probably don't need the complexity of linking them to the structures and diffractograms, as you could just stick it in the diffractogram block and piggyback off the linking that is already there. If each profile is in it's own block, you do need to link everything, but you get the simplicity of "this block is the just for that phase in that other diffractogram".

In both cases, the total number of datapoints you're adding is the same, as you still need to repeat each datapoint in the measured data for each profile you want to record.

.

I should explain my "clunky" comment. Ideally, you could have a single loop that gives columns for 2theta, meas_intensity, calc_intensity, and then one column per individual profile, but that would either necessitate repeating the profile intensity dataname in a loop, or having an arbitrary number of datanames to hold profile_1, profile_2... intensities

The clunkiness arises from having to repeat 2theta values in different loops or blocks that already exist.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

rowlesmr commented 2 years ago

Now, something like 20 years after the introduction of multiple blocks for related information in pdCIF, do we yet have any software that assembles multiple blocks?

pip install pdCIFplotter :)

(Just on that, Dave Billings should be emailing you and James about what has just started in the CPD)

.

I've only looked at the pictures in "DDLm: A New Dictionary Definition Language". Is it possible to have vectors, where their length is defined by another data item?

#using a mixture of old and new syntax, as I don't know how to upgrade...
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY

loop_
_pd_phase_id
_pd_phase_block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas_2theta_scan 
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
_pd_profile_intensity_net #length is defined as number of rows in _pd_phase_id ( or in _pd_phase_block_id)
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...
jamesrhester commented 2 years ago

I am reminded of a quote I read once in a database book that I've never been able to find again: the relational model is always second-best, meaning that in any given situation you can find a more efficient, streamlined way to represent data, but the relational model will still be second-best when the situation changes, while your original streamlined approach is now much worse.

Anyway.

Matthew's suggestion of using CIF2 vectors would be workable, with another vector defined somewhere as per one of Brian's suggestions above to give the order of phases. There is no need to define a length for a CIF2 vector.

So, I've slightly expanded Matthew's example below. How does it look?

Notes on the example:

  1. New category pd_phases is a per-diffraction-pattern category for information about phases in general.
  2. When there are multiple diffraction patterns that have been fit, each separate diffraction pattern would need a _pd_phases_presentation_order item
  3. We would define dREL routines within the dictionary that define the use of these new datanames to be equivalent to presenting each phase's partials in an appropriate per-phase-per-diffraction-pattern block, so the per-phase-per-diffraction-pattern approach would remain an option.
  4. CIF1 format files would not be able to use the vector notation
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases_presentation_order [a b c]

loop_
_pd_phase_id
_pd_phase_block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas_2theta_scan 
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
_pd_profile_phase_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...
rowlesmr commented 2 years ago

Is it not possible to automatically define _pd_phases_presentation_order [a b c] from loop_ _pd_phase_id a b c?

It would be easier to maintain the CIF file if I only need to write down the phases in one place. Although, I do recall from somewhere (pycifrw docs?) that row order isn't guaranteed in CIFs...

jamesrhester commented 2 years ago

No, the order of rows is very deliberately not significant. I understand your concerns with writing down the phases in more than one place, this is a key concern of the relational model, which aims to minimise duplication of information. The "ideal" relational approach in our case would have every separate phase in a separate data block, with no "summary block", meaning you really would only write the phases down once, and then shuffle the data around after reading it in, to match your problem of the day.

jamesrhester commented 1 year ago

Relevant to this issue is http://comcifs.github.io/accepted/multi-block-principles. The core dictionary combined with that document and PD Loop/Set dictionary decisions dictates how multi-wavelength/sample/diffraction condition/histogram data are distributed over multiple data blocks. I suggest we work together on a document that lays out the principles for PD, happy to draft a first attempt and put it up at comcifs.github.io for discussion. It would be great to have the PD commission involved as well, that can happen once a draft is up for discussion.

jamesrhester commented 1 year ago

I've now drafted a document for ongoing discussion: https://github.com/COMCIFS/comcifs.github.io/blob/master/draft/powder_data_presentation.md

rowlesmr commented 1 year ago

How about something like this?

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases.profile_presentation_order [a b c]

loop_
_pd_phase.id
_pd_phase.block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas.2theta_scan 
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
_pd_phase.profile_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...

New data items:

jamesrhester commented 1 year ago

As per previous discussions, I proposed that there would always be something like _pd_calc.phase_id to offer the option of tabulating the calculated contribution in separate per-phase blocks. I realise now that this idea of mine is fundamentally wrong, because pdCIF has been set up to make it possible to tabulate measured and calculated intensities in a single loop, and there is no particular phase that measured intensities (in general!) belong to, therefore there cannot be a specific phase associated with this loop and the intensities-in-a-list proposal is therefore compatible with current pdCIF.

Therefore, any per-phase calculated intensity loop must be in a different (new) category, let's call it pd_calc_components where the calculated intensities are listed for a single phase. By defining things this way, it is possible for the above component-intensities-in-a-list proposal and the per-phase listing proposal to be compatible and co-exist.

So that was a long-winded way of saying, yes, I have no objections to this proposal, as long as pd_calc_components exists.

rowlesmr commented 1 year ago

I only know enough to be dangerous, so questions:

isn't that what _pd_phases.profile_presentation_order is doing? mapping a _pd_phase.profile_intensity_total to a _pd_phase_id to a _pd_phase_block_id?

and PD_CALC_COMPONENTS needs to be a child of PD_DATA so everything can be looped nicely? Isn't that just shifting the issue down the line one step?

.

or is it something like:

_pd_phases.profile_presentation_order is a matrix of _pd_phase.id values, and _pd_phases.profile_intensity_net|total is a matrix of _pd_calc_components.profile_intensity_net|total

such that the order of values given in _pd_phases.profile_intensity_net matches the order of phases given in _pd_phases.profile_presentation_order

.

or is it to do with a summary block listing all of the histograms, phases, component profiles, and the like?

.

Example time!

component-intensities-in-a-list:

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases.profile_presentation_order [a b c]

loop_
_pd_phase.id
_pd_phase.block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas.2theta_scan 
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
_pd_phases.profile_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...

Per-phase listing

data_summary
#things go here

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY

loop_
_pd_phase.id
_pd_phase.block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas.2theta_scan 
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 120 120 100 
5.02 123 121 100 
#...

data_STR1XRAY_component_block
_pd_block.id STR1_XRAY
_pd_phase.block_id STR1
_pd_block.diffractogram_id XRAY
loop_ 
_pd_meas.2theta_scan 
_pd_calc_components.profile_intensity_net
5.00 7
5.02 7
#...

data_STR2XRAY_component_block
_pd_block_id STR2_XRAY
_pd_phase.block_id STR2
_pd_block.diffractogram_id XRAY
loop_ 
_pd_meas.2theta_scan 
_pd_proc.intensity_bkg_calc
_pd_calc_components.profile_intensity_total
5.00 100 108
5.02 100 108
#...

data_STR3XRAY_component_block
_pd_block_id STR3_XRAY
_pd_phase.block_id STR3
_pd_block.diffractogram_id XRAY
loop_ 
_pd_meas.2theta_scan 
_pd_proc.intensity_bkg_calc
_pd_calc_components.profile_intensity_net
5.00  5
5.02  6
#...
jamesrhester commented 1 year ago

So this is not to do with the summary block. By having the pd_calc_components block we can ensure that the machine-readable part of the dictionary is able to capture as many links as possible between data names, which in turn means that as much as possible of the dictionary can be interpreted and manipulated automatically using the relational model. The profile_presentation_order approach does capture the same relationships, but only if a programmer reads the text descriptions and implements the link between position in the list and phase, that is, the relationships are expressed outside of the relational model despite being expressible within the relational model. I know this is a bit of an abstract point, but experience shows that keeping as close as possible to the relational model keeps us robust against future changes.

Small point: lists (square-bracket-delimited values) are a CIF2 feature so any CIF reading software expecting CIF1 format is likely to fail rather than skipping over the value. Perhaps a more pedestrian reason for pd_calc_components as well as an incentive to handle CIF2?

I've written out some dREL below to assure myself that not having pd_calc_components is not going to render the data files somehow unable to be processed relationally. All seems fine so I can drop that pd_calc_components requirement for now and we can simply add it in future if it becomes desirable for some relational reason. For now dropping it just means that pure dictionary-based software that wants all information to do with a phase will not access any per-phase per-point information, and there is the CIF2 thing I mentioned above.

Also, CIF allows the use of massive image arrays of numbers instead of the pure relational approach of a table of x,y positions and pixel intensity. So it is not like using an array to save space is new.

I've written out some dREL showing the precise relationships between these categories. Note how dREL forces us to explicitly specify exactly how total intensity is calculated (ie whether or not scale factors are used).

# dREL pseudo-code for handling profile_presentation_order type information
# A Category method for populating a pd_calc_components category from profile_intensity_net information

loop pd as pd_calc {   # loop over the rows of pd_calc
   for phase_num in 1:len(pd.profile_intensity_net) {
       pd_calc_components.(point_id = pd.point_id,
                   phase_id = pd_phases.profile_presentation_order[phase_num],
                   profile_intensity_net = pd.profile_intensity_net[phase_num]
        )
    }
  }
# dREL pseudo code for total intensity: attached to _pd_calc.intensity_total
# Called for every row in pd_calc

t = 0
loop pcc as pd_calc_components {
    t = t + pcc.profile_intensity_net   #Is this right? Do we need a scale factor? Background?
    }
pd_calc.intensity_total = t

It is indeed possible to write dREL for the profile_presentation_order case, skipping pd_calc_components:

# dREL pseudo-code for calculating net total intensity, this is called for each row of pd_calc
t = 0
for i in pd_calc.profile_intensity_net {
    t = t + pd_calc.profile_intensity_net[i]
}
pd_calc.intensity_total = t

If we need to access the scale for a particular phase we get instead:

# dREL pseudo-code for calculating net total intensity scaled by phase scale
t = 0
for i in pd_calc.profile_intensity_net {
    ph = pd_phases.profile_presentation_order[i]
  scale = pd_xxx.scale[ph]  # Don't actually record the scale?
    t = t + pd_calc.profile_intensity_net[i] * scale
}
pd_calc.intensity_total = t
rowlesmr commented 1 year ago

I know this is a bit of an abstract point, but experience shows that keeping as close as possible to the relational model keeps us robust against future changes.

This sounds like a good reason to put it in.

Small point: lists (square-bracket-delimited values) are a CIF2 feature so any CIF reading software expecting CIF1 format is likely to fail rather than skipping over the value. Perhaps a more pedestrian reason for pd_calc_components as well as an incentive to handle CIF2?

I know the parser I'm fiddling around with writing for CIF1 just fails when it gets a '['. Pedestrian, but still legitimate.

.

_pd_calc.intensity_total includes bkg and normalisation, and so is specified on the same scale as the observed intensities. _pd_calc.intensity_net does not contain bkg or normalisation and so is specified on the same scale as _pd_proc.intensity_net.

so I think, strictly,

# dREL pseudo code for total intensity: attached to _pd_calc.intensity_total
# Called for every row in pd_calc

t = pd_proc.intensity_bkg_calc  #I don't know if this is legitimate, but its what I want to do.
loop pcc as pd_calc_components {
    t += pcc.profile_intensity_total - pd_proc.intensity_bkg_calc
    }
pd_calc.intensity_total = t

^ With this definition, overlaying _pd_meas.intensity_total and _pd_calc_components.profile_intensity_total, means they'll line up and overlap nicely; there is no bkg offset, the intensities are on the same scale...

# dREL pseudo code for total intensity: attached to _pd_calc.intensity_net
# Called for every row in pd_calc

t = 0
loop pcc as pd_calc_components {
    t += pcc.profile_intensity_net
    }
pd_calc.intensity_net = t

^ This definition, requires that the bkg and normalisation correctsion are identication for each _pd_calc_components.profile_intensity_net

.

I think that the scale foactor you're looking for should be _pd_proc.intensity_norm; _pd_proc.intensity_net doesn't go into detail on where to enumerate the "correction and normalization factors" used.

rowlesmr commented 1 year ago

Still need to add _pd_calc_component.phase_id and _pd_calc_component.diffractogram_id data names to indicate in a machine-readable way that the information in pd_calc_component is per phase, per diffractogram.

rowlesmr commented 1 year ago

Still need to add _pd_calc_component.phase_id and _pd_calc_component.diffractogram_id data names to indicate in a machine-readable way that the information in pd_calc_component is per phase, per diffractogram.

They are there.