Open vaitkus opened 1 year ago
This is not directly answering your question, but I’ll comment on my motivation here way back in the distant past. In the most common case, one collects a diffraction pattern and fits those points. One loop. In less common cases, one collects a pattern at too fine a point spacing and for fitting, some of the observed points are merged together into a processed pattern that is used for fitting. Two loops: one for observed data & one for processed & calc. I wanted to use the same data names for obs & calc in both cases, database normalization (which was added later as a CIF goal) be damned.
On Jun 22, 2023, at 8:40 AM, Antanas Vaitkus @.***> wrote:
This is a bit of a technical question related to the parent-child relationships between looped categories.
The looped PD_DATA category is intended to function as "a 'container' category that is defined in order to allow raw, processed, and calculated data points in a diffraction data set to be optionally tabulated together". This is reflected by the fact that the looped PD_CALC, PD_MEAS and PD_PROC categories have it as their parent category. Now I have the following questions:
— Reply to this email directly, view it on GitHubhttps://github.com/COMCIFS/Powder_Dictionary/issues/159, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACH7E2HWOUF2E7EYWTTLCNTXMRDOHANCNFSM6AAAAAAZQHL6NE. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@vaitkus' suggestion (2) is better, indeed the *.diffractogram_id
data names should point back to the PD_DATA
category. This will just make it easier for checking software, I think, as mathematically they are all come down to the same thing.
The current parent-child arrangement of these categories was created (by me) as a way to use the parent-child relationships in DDLm to properly fit these categories into a relational scheme, so that data names from apparently different categories could be looped together, as Brian T wanted. The only "wrinkle" is that the PD_DATA
category could be completely empty, while the child categories are populated, potentially violating the rule that child data names draw from the values of their parent data names. So the new additional rule is that, if the parent data names are completely missing, then their possible values are drawn from the union of their child data name values. This rule was created especially for pdCIF, and so far hasn't been used anywhere else, but the situation that Brian describes could, in principle, arise again.
@vaitkus' suggestion (2) is better, indeed the
*.diffractogram_id
data names should point back to thePD_DATA
category. This will just make it easier for checking software, I think, as mathematically they are all come down to the same thing.
Ok, I'll create a PR for that.
The current parent-child arrangement of these categories was created (by me) as a way to use the parent-child relationships in DDLm to properly fit these categories into a relational scheme, so that data names from apparently different categories could be looped together, as Brian T wanted. The only "wrinkle" is that the
PD_DATA
category could be completely empty, while the child categories are populated, potentially violating the rule that child data names draw from the values of their parent data names. So the new additional rule is that, if the parent data names are completely missing, then their possible values are drawn from the union of their child data name values. This rule was created especially for pdCIF, and so far hasn't been used anywhere else, but the situation that Brian describes could, in principle, arise again.
I wonder if there won't be a slight problem here due to people incorrectly assuming that the *.point_id
values can be independently assigned in each category loop while in reality they share the same namespace. What I mean, people will naturally want to sequentially number points from 1 in each category loop and in this way may inadvertently merge properties from several points into one.
Consider the following example in which the values from separate categories get incorrectly merged due to reused point IDs:
loop_
_pd_calc.point_id
_pd_calc.diffractogram_id
_pd_calc.item_1
_pd_calc.item_2
1 DIFFRACT 5 7
2 DIFFRACT 6 3
# ...
loop_
_pd_meas.point_id
_pd_meas.diffractogram_id
_pd_meas.item_a
1 DIFFRACT a
2 DIFFRACT a
3 DIFFRACT c
# ...
loop_
_pd_proc.point_id
_pd_proc.diffractogram_id
_pd_proc.item_x
1 DIFFRACT x
2 DIFFRACT y
3 DIFFRACT z
# ...
Joint PD_DATA
loop:
_pd_data.point_id
_pd_data.diffractogram_id
_pd_calc.item_1
_pd_calc.item_2
_pd_meas.item_a
_pd_proc.item_x
1 DIFFRACT 5 7 a x
2 DIFFRACT 6 3 a y
3 DIFFRACT ? ? c z
I am not saying that anything should be redesigned here, but maybe a disclaimer of some sort on the shared point identifier namespace should be added?
.
I wonder if there won't be a slight problem here due to people incorrectly assuming that the *.point_id values can be independently assigned in each category loop while in reality they share the same namespace.
Yes. I think the NiSi example cif does this, https://web.archive.org/web/20131219003229/http://www.ccp14.ac.uk/ccp/ccp14/ftp-mirror/briantoby/pub/cryst/cif/NISI.cif
There isn't a one-to-one correspondance in point_id, and it isn't just an offset; you can see that the last peak in both datasets correspond, and so on, just not linearly with point id.
I think this ties in with a previous correspondence I had with James re the number of permissible PD_DATA loops in a block, which I think is summarised here.
In particular, how can we deal with measured and processed diffractograms where there isn't a one-to-one correspondence of data points? If we're averaging, smoothing, splining, or otherwise altering data points (ie changing the data points such athat there isn't a one-to-one correspondence), can they be considered to be the same diffractogram (as in have identical _pd_diffractogram.id
values)? Merely subsetting a measured dataset to get a processed one is already taken care of with .
.
Maybe there could be _pd_proc_overall.source_diffractogram_id
to point
save_pd_proc_overall.source_diffractogram_id
_definition.id '_pd_proc_overall.source_diffractogram_id'
_definition.update 2023-06-24
_description.text
;
The original diffractogram (see _pd_diffractogram.id) from which the
current processed data were taken.
This is used to refer to the original, measured data from which the
current data has been created. It is to be used when there is not a
one-to-one correspondence between data points.
Detail the processing steps utilised in _pd_proc.info_special_details.
;
_name.category_id pd_proc_overall
_name.object_id source_diffractogram_id
_type.purpose Encode
_type.source Related
_type.container Single # Maybe List to have more than one diff_id?
_type.contents Text
save_
This only allows a proc dataset to be derived from a single meas dataset, where in practice, it could be more. but it is a starting point. (maybe it could have a List
of diff_id values)
This is just saying that there are no calc items for that data point, which may be entirely legit; consider an excluded region. Strictly, the data points excluded should be given a .
value, in that case.
loop_
_pd_data.point_id
_pd_data.diffractogram_id
_pd_calc.item_1
_pd_calc.item_2
_pd_meas.item_a
_pd_proc.item_x
1 DIFFRACT 5 7 a x
2 DIFFRACT 6 3 a y
3 DIFFRACT ? ? c z
But yes, your concern is definitely a legitimate one and one that I've seen in the wild.
I wonder if there won't be a slight problem here due to people incorrectly assuming that the
*.point_id
values can be independently assigned in each category loop while in reality they share the same namespace.
Maybe add some text to the descriptions of `_pd_calc|meas|proc.point_id"?
The current descriptions are (essentially):
Arbitrary label identifying a calculated|measured|processed data point. Used to identify a specific entry in a list of values forming the calculated|measured|process diffractogram. The role of this identifier may be adopted by _pd_data.point_id if measured, processed, and/or calculated intensity values are combined in a single list.
Could be changed to something like:
Arbitrary label identifying a calculated|measured|processed data point. Used to identify a specific entry in a loop of values forming the calculated|measured|process diffractogram. Note that identical values of _pd_calc.point_id, _pd_meas.point_id, and _pd_proc.point_id refer to the same point, and thus provide a way of indicating that points in disparate loops are equivalent. The role of this identifier may be adopted by _pd_data.point_id if measured, processed, and/or calculated intensity values are combined in a single loop.
For reference, the description of _pd_calc_component.point_id
(in a different category) is
Arbitrary label identifying a calculated component data point. Used to identify a specific entry in a list of values forming the calculated component diffractogram.
The value of _pd_calc_component.point_id must be the same as the value of _pd_data.point_id given to the equivalent data point in the measured/processed/calculated diffractogram to which this component belongs.
This is a bit of a technical question related to the parent-child relationships between looped categories.
The looped
PD_DATA
category is intended to function as "a 'container' category that is defined in order to allow raw, processed, and calculated data points in a diffraction data set to be optionally tabulated together". This is reflected by the fact that the loopedPD_CALC
,PD_MEAS
andPD_PROC
categories have it as their parent category. Now I have the following questions:PD_CALC
,PD_MEAS
andPD_PROC
. Is this the intention or could this lead to some data anomalies?_cat.point_id
,_cat.diffractogram_id
]. All_cat.point_id
data items (e.g._pd_calc.point_id
) are properly directly linked to the key of the parent category (_pd_data.point_id
). However, all_cat.diffractogram_id
items, including the one from the parentPD_DATA
category, are linked to the_pd_diffractogram.id
. Strictly from a formal point of view -- is this allowed (i.e. software should figure this out) or should links of_pd_calc.diffractogram_id
,_pd_meas.diffractogram_id
and_pd_proc.diffractogram_id
actually be linked to_pd_data.diffractogram_id
(and thus only transiently to_pd_diffractogram.id
via_pd_data.diffractogram_id
)?