GSIM Structure Group definition / explanatory text update

InKyungChoi commented 2 years ago

Please see this google doc for the feedback from Metadata Glossary team. I would like to draw attention to:

Data Resource: I propose to delete
- "organized" (because there is nothing we can show it is "organized" (e.g., compare with the definition of Data Set "organized collection of data" which makes sense in this case because it has Data Structure))
- "Data Resources are.." (because they don't have to be "used by stat activities" nor "for production of information". How about just add examples like "Example: collection of labour survey data from 2020 to 2021")

Object	Definition	Explanatory Text
Data Resource	~~organized~~ collection of stored information made of one or more Data Sets.	~~Data Resources are collections of data that are used by a statistical activity to produce information.~~ Data Resource is a specialization of an Information Resource.

Dimensional Data Point: see this feedback from metadata glossary team
Dimensional Data Set and UnitData Set: Need examples
Information Resource: I propose to delete
- "organized" (for the same reason as 1)
- "statistical content" (does it have to be "statistical"?)

Object	Definition	Explanatory Text
Information Set	~~organized~~ collections of ~~statistical content~~information	Statistical organizations collect, process, analyze and disseminate Information Sets, which contain data (Data Sets), referential metadata (Referential Metadata Sets), or potentially other types of statistical content, which could be included in additional types of Information Set.

Logical Record: I am very confused with the definition...

Object	Definition	Explanatory Text
Logical Record	Describes a type of Unit Data Record for one Unit Type within a Unit Data Set.	Examples: household, person, or dwelling record.

New definitions and explanatory texts are proposed for all referential metadata-related classes here: https://github.com/UNECE/GSIMRevision/issues/6
Unit Data Point : see this metadata glossary team feedback

FrancineK commented 2 years ago

Just a remark, concerning "organized", I would not delete altogether. This page : https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210000101&request_locale=en, is an information set which has a data set and a referential metadata set. It is organized.

InKyungChoi commented 2 years ago

(from Oct 26 meeting notes #30)

Dimensional Data Point and Unit Data Point (see this google doc)

Action: @FlavioRizzolo @dgillman4909 to review the explanatory text and update (for Dimensional Data Point, second sentence “there may be multiple values…” is not so needed here; also third sentence “the different values represent…” are also not necessary)

InKyungChoi commented 2 years ago

Regarding examples of Dimensional Data Set and UnitData Set:

I suggest

Example of Unit Data Set**: a collection of Unit Records (1212123, 48, American, United Kingdom), (1212111, 38, Hungarian, United Kingdom), (1212317, 51, Canadian, Mexico) for three people where each record has the social security number, age, citizenship and the country of birth.
Example of Dimensional Data set: a collection of dimensional data (Mexico, 130.3), (United Kingdom, 331.9), (Italy, 59.1) where the firsts item specifies the name of the country and the second item specifies the population in millions.

** I followed the example of Unit Data Record which is "For example (1212123, 48, American, United Kingdom) specifies the age (48) in years, the current citizenship (American), and the country of birth (United Kingdom) for a person with social security number 1212123". But actually, Data Set is an aggregation of Data Points which are just placeholder (cell), not actually datum in them. Then should we add in the explanatory text that the examples are instantiated version of Data Set?

(@FrancineK, somehow I don't see the examples in the Specification document)

FlavioRizzolo commented 1 year ago

A couple of proposals here with slightly updated definitions and new explanatory texts (to be completed):

Object | Definition | Explanatory Text -- | -- | -- Data Point | Container for a single value of an Instance Variable | A Data Point is a cell or a placeholder for a value (datum) it may contain (note that a data point could be empty). Object | Definition | Explanatory Text -- | -- | -- Dimensional Data Point | Container for a single value of an Instance Variable partially identified by a set of dimensions | A Dimensional Data Point is uniquely identified by the combination of exactly one value for each of the dimensions (represented as Identifier Components) and exactly one measure (represented as Measure Component) or descriptive attribute (represented as Attribute Component). A Dimensional Data Point could contain a value about a Unit or a Population. The Unit might be de-identified, in which case no link to the Unit itself can be directly established. Object | Definition | Explanatory Text -- | -- | -- Unit Data Point | Container for a single value of an Instance Variable about a Unit | A Unit Data Point is uniquely identified by the combination of exactly one value from each Identifier Component. The Unit might be de-identified, in which case no link to the Unit itself can be directly established.

Now, this note about de-identified information means that the Unit associated with the Data Point is not required, so we need to change the cardinality to 0..1. The constraint associated with the Dimensional Data Point stating that the relationship to either Unit or Population must exist needs to be deleted since neither is required in the de-identified scenario. (BTW, that constraint can only be found in the full Structures Group diagram, we need move those types of constraints into the text, perhaps the explanatory notes.)

FlavioRizzolo commented 1 year ago

We need to revise the associations between Data Point and Instance Variable:

I propose to remove replace them by just one called "is described by", since the current ones are redundant at best and wrong at worst. There is no need to qualify the association with "identifier", "attribute" or "measure" since that's taken care of elsewhere: the Instance Variable is associated with some Represented Variable which in turn plays the role of being either a Identifier, Attribute or Measure Component in a particular Data Structure. By having a qualified association directly to the Instance Variable we are bypassing the Data Structure and therefore fixing a given Instance Variable into a specific role, i.e. identifier, attribute or measure, which defeats the purpose of having a separate Data Structure capturing that semantics and providing the flexibility of changing it when necessary.

dgillman4909 commented 1 year ago

Flavio,

I am OK with creating just one relationship. But, shouldn’t the verb be “populates” rather than “describes”? It is through the IV that a Data Point is given some data. The IV doesn’t describe a Data Point. One Data Point, that in a reusable structure, can have many IV’s associated with it. If we go with “describes” this will mean the description of a Data Point changes over time or collections. I suppose there is an argument in favor of that, but why open us up for the need to explain this? It seems to me “populates” expresses what is happening more directly. And, I am open to suggestions for other verbs rather than “populates” as long as a similar idea is being expressed.

Dan

From: flavio @.> Sent: Tuesday, November 15, 2022 6:53 AM To: UNECE/GSIMRevision @.> Cc: Gillman, Daniel - BLS @.>; Mention @.> Subject: Re: [UNECE/GSIMRevision] GSIM Structure Group definition / explanatory text update (Issue #28)

CAUTION: This email originated from outside of BLS. DO NOT click (select) links or open attachments unless you recognize the sender and know the content is safe. Please report suspicious emails through the “Phish Alert Report” button on your email toolbar.

We need to revise the associations between Data Point and Instance Variable:

[image]https://user-images.githubusercontent.com/11507347/201910583-042493c0-55f6-44fb-bf69-c5d84a7c7ed5.png

I propose to remove replace them by just one called "is described by", since the current ones are redundant at best and wrong at worst. There is no need to qualify the association with "identifier", "attribute" or "measure" since that's taken care of elsewhere: the Instance Variable is associated with some Represented Variable which in turn plays the role of being either a Identifier, Attribute or Measure Component in a particular Data Structure. By having a qualified association directly to the Instance Variable we are bypassing the Data Structure and therefore fixing a given Instance Variable into a specific role, i.e. identifier, attribute or measure, which defeats the purpose of having a separate Data Structure capturing that semantics and providing the flexibility of changing it when necessary.

— Reply to this email directly, view it on GitHubhttps://github.com/UNECE/GSIMRevision/issues/28#issuecomment-1315202951, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIVKL6H2TRCSLMNTGQTGFLWIN2TLANCNFSM6AAAAAARENWYBA. You are receiving this because you were mentioned.Message ID: @.**@.>>

FlavioRizzolo commented 1 year ago

I think you are right, Dan, "populates" is better.

dgillman4909 commented 1 year ago

Flavio,

I like the definition for Data Point. But, I am very much against the idea of using dimensions as a way to identify a cell in an n-cube. The combination of names of the categories, one from each dimension, that names a cell should not be used to identify one. Names can change, even though the underlying structure remains. The combination of the meaning of the categories, one from each dimension, IS part of the meaning of each cell.

Think of the difference between a code in NAICS versus the name of each category. The codes are identifying. The names help convey the meaning. The same should happen in multi-dimensional cells.

So, the definitions of Dimensional Data Point and Unit Data Point need to change. The main issue as I see it is whether multi-dimensional data can be represented in wide, long, or key-value structures. I think they can. I am afraid we are mixing a couple of structural issues together. One is the way variables are mapped to a structure. The other is the usual way multi-dimensional versus unit record data are laid out. We need to model the first. The second is a choice for the user.

It seems to me the details of multi-dimensionality is the problem. The n-cube structure is a natural way to handle those data. But it is not the only way. N-cubes must have Dimensional Data Points. One doesn’t use an n-cube for organizing unit record data, I think. However you can organize multi-dimensional data in any of the other structures if you want to.

Therefore, it seems to me that having an explicit difference (any versus multi-dimensional) is forcing us into a corner. The BLS model is explicitly about the semantics of the data, not the structure. The structure follows.

I wish I could be more precise and present a solution.

Dan

From: flavio @.> Sent: Tuesday, November 15, 2022 6:34 AM To: UNECE/GSIMRevision @.> Cc: Gillman, Daniel - BLS @.>; Mention @.> Subject: Re: [UNECE/GSIMRevision] GSIM Structure Group definition / explanatory text update (Issue #28)

CAUTION: This email originated from outside of BLS. DO NOT click (select) links or open attachments unless you recognize the sender and know the content is safe. Please report suspicious emails through the “Phish Alert Report” button on your email toolbar.

A couple of proposals here with slightly updated definitions and new explanatory texts (to be completed): Object Definition Explanatory Text Data Point Container for a single value of an Instance Variable A Data Point is a cell or a placeholder for a value (datum) it may contain (note that a data point could be empty).

Object Definition Explanatory Text Dimensional Data Point Container for a single value of an Instance Variable partially identified by a set of dimensions A Dimensional Data Point is uniquely identified by the combination of exactly one value for each of the dimensions (represented as Identifier Components) and exactly one measure (represented as Measure Component) or descriptive attribute (represented as Attribute Component). A Dimensional Data Point could contain a value about a Unit or a Population. The Unit might be de-identified, in which case no link to the Unit itself can be directly established.

Object Definition Explanatory Text Unit Data Point Container for a single value of an Instance Variable about a Unit A Unit Data Point is uniquely identified by the combination of exactly one value from each Identifier Component. The Unit might be de-identified, in which case no link to the Unit itself can be directly established.

Now, this note about de-identified information means that the Unit associated with the Data Point is not required, so we need to change the cardinality to 0..1. The constraint associated with the Dimensional Data Point stating that the relationship to either Unit or Population must exist needs to be deleted since neither is required in the de-identified scenario. (BTW, that constraint can only be found in the full Structures Group diagram, we need move those types of constraints into the text, perhaps the explanatory notes.)

— Reply to this email directly, view it on GitHubhttps://github.com/UNECE/GSIMRevision/issues/28#issuecomment-1315182934, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIVKL4MDZNTZVVWWPAZI4DWINYJTANCNFSM6AAAAAARENWYBA. You are receiving this because you were mentioned.Message ID: @.**@.>>

FlavioRizzolo commented 1 year ago

I couldn't agree more, Dan. I've struggled with this since the notion was introduced almost 10 years ago, and couldn't come up with a good definition in all this time.

I think you hinted what the problem is: this is an artificial distinction. I don't see any reason to have a distinction between Unit and Dimensional Data Point. Why not just removing those subclasses entirely and keep only Data Point? Let's think about that option.

InKyungChoi commented 1 year ago

(from Nov 16 meeting notes https://github.com/UNECE/GSIMRevision/discussions/33) Here is the updated version

@FlavioRizzolo - would you want to update Data Point explanatory text using explanatory texts from Unit Data Point and Dimensional Data Point?

If we indeed remove Unit Data Set and Dimensional Data Set:

1) I would like to propose below definition and explanatory text of Data Set

definition: organised collection of data (no change)
explanatory text: Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, field. Data Set can be Unit Data Set or Dimensional Data Set (and perhaps use examples here)?

2) Add a new attribute "Type" with controlled vocabulary (Unit Data Set, Dimensional Data Set)

3) There are two attributes under Dimensional Data Set (Reporting Begin/End), I don't understand why these attributes are here.... if no one disagrees, we can drop these.

FlavioRizzolo commented 1 year ago

I agree with 2, 3 and the definition in 1. In the explanatory text, I don't understand the second and third sentences around "broader" and "narrower", I think they are confusing and might be legacy from long time ago. For narrower we could say that data sets could be further partitioned/organized into data elements, data records, cells, fields, etc., but I'm not even sure that's necessary given that those artifacts are not in the model.

I propose the following:

Explanatory text: Data Sets could be used to organize a wide variety of content, including observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, registers, data cubes, data warehouses/marts and matrixes. An example of a population unit Data Set (microdata) could be a collection of three Data Records (1212123, 48, American, United Kingdom), (1212111, 38, Hungarian, United Kingdom), and (1212317, 51, Canadian, Mexico), each containing the social security number, age, citizenship and country of birth of an individual. An example of a population dimensional Data Set (aggregate) could be a collection of three entries (Mexico, 2021, 130.3), (United Kingdom, 2021, 67.33), and (Italy, 2022, 60.24), each containing the name of the country, year of interest and population of the country in millions.

FlavioRizzolo commented 1 year ago

The way Logical Record is defined, and linked in the model, applies only to unit data, in which case either (i) Data Record is only about units and therefore we need another association directly from Data Set to Data Point for dimensional data, or (ii), Data Record is about both unit and dimensional data and therefore "isStructuredBy" Logical Record is optional (since it only applies to unit data).

InKyungChoi commented 1 year ago

(from Nov 16 meeting notes https://github.com/UNECE/GSIMRevision/discussions/34) Here is the updated version

FlavioRizzolo commented 1 year ago

To be modelled in EA

InKyungChoi commented 1 year ago

Final question before finalisation: I think we are still missing the definition and explanatory text of "Data Record" (previously "Unit Data Record") - how about this (adapted from the original Unit Data Record)?

Unit Data Point (in GSIM v1.2) Object	Group	Definition	Explanatory Text
Unit Data Record	Structures	Contains the specific values (as a collection of Unit Data Points) related to a given Unit as defined in a Logical Record.	For example (1212123, 48, American, United Kingdom) specifies the age (48) in years on the 1st of January 2012 in years, the current citizenship (American), and the country of birth (United Kingdom) for a person with social security number 1212123.

Data Point (in GSIM v2.0) Object	Group	Definition	Explanatory Text
~Unit~ Data Record	Structures	container for the specific values (as a collection of ~Unit~ Data Points) related to a given Unit or Population ~as defined in a Logical Record.~	For example (1212123, 48, American, United Kingdom) specifies the age (48) in years on the 1st of January 2012 in years, the current citizenship (American), and the country of birth (United Kingdom) for a person with social security number 1212123. For the case of unit data, it is structured by Logical Record.

dgillman4909 commented 1 year ago

Flavio -

Going back to DataPoint versus UnitDP and DimensionalDP, I agree having just the one class, DataPoint, is the right way to go. In a comment above, I said that NCubes only take DimensionalDataPoints, and I'd like to retract that. I am working with a longitudinal survey at US BLS, and there's an application for NCubes to account for variables that repeat within each wave and the set repeats from wave to wave. The "measure" in this case is the variable, and the dimensions are time (the waves) and the number of repetitions that variable can take. In the National Longitudinal Survey at BLS, multiple variables exist to account for each job a person can hold at once.. NLS allows for 9 jobs and the 1997 cohort is not up to 19 waves. This is 181 variables of each kind, a monstrosity to say the least. An NCube (defined as we do in DDI-CDI, which can take more than one measure for each cell) allows us to account for all of them.

InKyungChoi commented 1 year ago

Decision made regarding the Data Record (see #43):

Action: to use “collection of Data Points related to a given Unit or Population” for the definition, to use “….(as proposed in https://github.com/UNECE/GSIMRevision/issues/28#issuecomment-1478044278l).. For the case of unit data, it can be structured by Logical Record” for the explanatory text

UNECE / GSIMRevision

GSIM Structure Group definition / explanatory text update #28