The keys of `PD_CALC`, `PD_MEAS` and `PD_PROC`

COMCIFS / Powder_Dictionary

CIF definitions for powder diffraction

4 stars 4 forks source link

The keys of `PD_CALC`, `PD_MEAS` and `PD_PROC` #159

Open vaitkus opened 1 year ago

vaitkus commented 1 year ago

This is a bit of a technical question related to the parent-child relationships between looped categories.

The looped PD_DATA category is intended to function as "a 'container' category that is defined in order to allow raw, processed, and calculated data points in a diffraction data set to be optionally tabulated together". This is reflected by the fact that the looped PD_CALC, PD_MEAS and PD_PROC categories have it as their parent category. Now I have the following questions:

This allows the same point to have properties from all three categories, i.e. the same point can be described using items from PD_CALC, PD_MEAS and PD_PROC. Is this the intention or could this lead to some data anomalies?
All four categories have composite keys in the form of [ _cat.point_id, _cat.diffractogram_id]. All _cat.point_id data items (e.g. _pd_calc.point_id) are properly directly linked to the key of the parent category (_pd_data.point_id). However, all _cat.diffractogram_id items, including the one from the parent PD_DATA category, are linked to the _pd_diffractogram.id. Strictly from a formal point of view -- is this allowed (i.e. software should figure this out) or should links of _pd_calc.diffractogram_id, _pd_meas.diffractogram_id and _pd_proc.diffractogram_id actually be linked to _pd_data.diffractogram_id (and thus only transiently to _pd_diffractogram.id via _pd_data.diffractogram_id)?

briantoby commented 1 year ago

This is not directly answering your question, but I’ll comment on my motivation here way back in the distant past. In the most common case, one collects a diffraction pattern and fits those points. One loop. In less common cases, one collects a pattern at too fine a point spacing and for fitting, some of the observed points are merged together into a processed pattern that is used for fitting. Two loops: one for observed data & one for processed & calc. I wanted to use the same data names for obs & calc in both cases, database normalization (which was added later as a CIF goal) be damned.

On Jun 22, 2023, at 8:40 AM, Antanas Vaitkus @.***> wrote:

This is a bit of a technical question related to the parent-child relationships between looped categories.

The looped PD_DATA category is intended to function as "a 'container' category that is defined in order to allow raw, processed, and calculated data points in a diffraction data set to be optionally tabulated together". This is reflected by the fact that the looped PD_CALC, PD_MEAS and PD_PROC categories have it as their parent category. Now I have the following questions:

This allows the same point to have properties from all three categories, i.e. the same point can be described using items from PD_CALC, PD_MEAS and PD_PROC. Is this the intention or could this lead to some data anomalies?
All four categories have composite keys in the form of [ _cat.point_id, _cat.diffractogram_id]. All _cat.point_id data items (e.g. _pd_calc.point_id) are properly directly linked to the key of the parent category (_pd_data.point_id). However, all _cat.diffractogram_id items, including the one from the parent PD_DATA category, are linked to the _pd_diffractogram.id. Strictly from a formal point of view -- is this allowed (i.e. software should figure this out) or should links of _pd_calc.diffractogram_id, _pd_meas.diffractogram_id and _pd_proc.diffractogram_id actually be linked to _pd_data.diffractogram_id (and thus only transiently to _pd_diffractogram.id via _pd_data.diffractogram_id)?

— Reply to this email directly, view it on GitHubhttps://github.com/COMCIFS/Powder_Dictionary/issues/159, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACH7E2HWOUF2E7EYWTTLCNTXMRDOHANCNFSM6AAAAAAZQHL6NE. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jamesrhester commented 1 year ago

@vaitkus' suggestion (2) is better, indeed the *.diffractogram_id data names should point back to the PD_DATA category. This will just make it easier for checking software, I think, as mathematically they are all come down to the same thing.

The current parent-child arrangement of these categories was created (by me) as a way to use the parent-child relationships in DDLm to properly fit these categories into a relational scheme, so that data names from apparently different categories could be looped together, as Brian T wanted. The only "wrinkle" is that the PD_DATA category could be completely empty, while the child categories are populated, potentially violating the rule that child data names draw from the values of their parent data names. So the new additional rule is that, if the parent data names are completely missing, then their possible values are drawn from the union of their child data name values. This rule was created especially for pdCIF, and so far hasn't been used anywhere else, but the situation that Brian describes could, in principle, arise again.

vaitkus commented 1 year ago

@vaitkus' suggestion (2) is better, indeed the *.diffractogram_id data names should point back to the PD_DATA category. This will just make it easier for checking software, I think, as mathematically they are all come down to the same thing.

Ok, I'll create a PR for that.

The current parent-child arrangement of these categories was created (by me) as a way to use the parent-child relationships in DDLm to properly fit these categories into a relational scheme, so that data names from apparently different categories could be looped together, as Brian T wanted. The only "wrinkle" is that the PD_DATA category could be completely empty, while the child categories are populated, potentially violating the rule that child data names draw from the values of their parent data names. So the new additional rule is that, if the parent data names are completely missing, then their possible values are drawn from the union of their child data name values. This rule was created especially for pdCIF, and so far hasn't been used anywhere else, but the situation that Brian describes could, in principle, arise again.

I wonder if there won't be a slight problem here due to people incorrectly assuming that the *.point_id values can be independently assigned in each category loop while in reality they share the same namespace. What I mean, people will naturally want to sequentially number points from 1 in each category loop and in this way may inadvertently merge properties from several points into one.

Consider the following example in which the values from separate categories get incorrectly merged due to reused point IDs:

loop_
_pd_calc.point_id
_pd_calc.diffractogram_id
_pd_calc.item_1
_pd_calc.item_2
1 DIFFRACT 5 7
2 DIFFRACT 6 3
# ...

loop_
_pd_meas.point_id
_pd_meas.diffractogram_id
_pd_meas.item_a
1 DIFFRACT a
2 DIFFRACT a
3 DIFFRACT c
# ...

loop_
_pd_proc.point_id
_pd_proc.diffractogram_id
_pd_proc.item_x
1 DIFFRACT x
2 DIFFRACT y
3 DIFFRACT z
# ...

Joint PD_DATA loop:

_pd_data.point_id
_pd_data.diffractogram_id
_pd_calc.item_1
_pd_calc.item_2
_pd_meas.item_a
_pd_proc.item_x
1 DIFFRACT 5 7 a x 
2 DIFFRACT 6 3 a y
3 DIFFRACT ? ? c z

I am not saying that anything should be redesigned here, but maybe a disclaimer of some sort on the shared point identifier namespace should be added?

rowlesmr commented 1 year ago

I wonder if there won't be a slight problem here due to people incorrectly assuming that the *.point_id values can be independently assigned in each category loop while in reality they share the same namespace.

Yes. I think the NiSi example cif does this, https://web.archive.org/web/20131219003229/http://www.ccp14.ac.uk/ccp/ccp14/ftp-mirror/briantoby/pub/cryst/cif/NISI.cif

There isn't a one-to-one correspondance in point_id, and it isn't just an offset; you can see that the last peak in both datasets correspond, and so on, just not linearly with point id.

I think this ties in with a previous correspondence I had with James re the number of permissible PD_DATA loops in a block, which I think is summarised here.

In particular, how can we deal with measured and processed diffractograms where there isn't a one-to-one correspondence of data points? If we're averaging, smoothing, splining, or otherwise altering data points (ie changing the data points such athat there isn't a one-to-one correspondence), can they be considered to be the same diffractogram (as in have identical _pd_diffractogram.id values)? Merely subsetting a measured dataset to get a processed one is already taken care of with ..

Maybe there could be _pd_proc_overall.source_diffractogram_id to point

save_pd_proc_overall.source_diffractogram_id

    _definition.id                '_pd_proc_overall.source_diffractogram_id'
    _definition.update            2023-06-24
    _description.text
;
    The original diffractogram (see _pd_diffractogram.id) from which the 
    current processed data were taken.

    This is used to refer to the original, measured data from which the
    current data has been created. It is to be used when there is not a 
    one-to-one correspondence between data points.

    Detail the processing steps utilised in _pd_proc.info_special_details.
;
    _name.category_id             pd_proc_overall
    _name.object_id               source_diffractogram_id
    _type.purpose                 Encode
    _type.source                  Related
    _type.container               Single  # Maybe List to have more than one diff_id?
    _type.contents                Text

save_

This only allows a proc dataset to be derived from a single meas dataset, where in practice, it could be more. but it is a starting point. (maybe it could have a List of diff_id values)

rowlesmr commented 1 year ago

This is just saying that there are no calc items for that data point, which may be entirely legit; consider an excluded region. Strictly, the data points excluded should be given a . value, in that case.

loop_
_pd_data.point_id
_pd_data.diffractogram_id
_pd_calc.item_1
_pd_calc.item_2
_pd_meas.item_a
_pd_proc.item_x
1 DIFFRACT 5 7 a x 
2 DIFFRACT 6 3 a y
3 DIFFRACT ? ? c z

But yes, your concern is definitely a legitimate one and one that I've seen in the wild.

rowlesmr commented 1 year ago

I wonder if there won't be a slight problem here due to people incorrectly assuming that the *.point_id values can be independently assigned in each category loop while in reality they share the same namespace.

Maybe add some text to the descriptions of `_pd_calc|meas|proc.point_id"?

The current descriptions are (essentially):

Arbitrary label identifying a calculated|measured|processed data point. Used to identify a specific entry in a list of values forming the calculated|measured|process diffractogram. The role of this identifier may be adopted by _pd_data.point_id if measured, processed, and/or calculated intensity values are combined in a single list.

Could be changed to something like:

Arbitrary label identifying a calculated|measured|processed data point. Used to identify a specific entry in a loop of values forming the calculated|measured|process diffractogram. Note that identical values of _pd_calc.point_id, _pd_meas.point_id, and _pd_proc.point_id refer to the same point, and thus provide a way of indicating that points in disparate loops are equivalent. The role of this identifier may be adopted by _pd_data.point_id if measured, processed, and/or calculated intensity values are combined in a single loop.

For reference, the description of _pd_calc_component.point_id (in a different category) is

Arbitrary label identifying a calculated component data point. Used to identify a specific entry in a list of values forming the calculated component diffractogram.

The value of _pd_calc_component.point_id must be the same as the value of _pd_data.point_id given to the equivalent data point in the measured/processed/calculated diffractogram to which this component belongs.