Add pattern supporting use of value labels, categoricals and factors

frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

https://datapackage.org

The Unlicense

493 stars 112 forks source link

Add pattern supporting use of value labels, categoricals and factors #844

Closed pschumm closed 10 months ago

pschumm commented 1 year ago

This pull request adds a pattern for supporting the use of value labels and categoricals (sometimes called factors), as requested by @rufuspollock here. Comments and feedback are welcome.

rufuspollock commented 1 year ago

Great to see the thorough review and looking forward to seeing this in - deferring to @roll on that (@roll i leave it to you when you want to merge).

pschumm commented 1 year ago

Great to see the thorough review and looking forward to seeing this in - deferring to @roll on that (@roll i leave it to you when you want to merge).

Thanks; I'll make the changes to resolve the final issue raised by @peterdesmet above, and then you can merge anytime.

roll commented 1 year ago

Thanks a lot, everyone! Just let me know when it's ready it's really exciting to see this collaborative effort happening :heart:

pschumm commented 1 year ago

Thanks a lot, everyone! Just let me know when it's ready it's really exciting to see this collaborative effort happening ❤️

Ok @roll, I've completed all of the suggested changes and I think this is now ready to merge. Thanks!

roll commented 1 year ago

Thanks a lot, everyone! Just let me know when it's ready it's really exciting to see this collaborative effort happening ❤️

Ok @roll, I've completed all of the suggested changes and I think this is now ready to merge. Thanks!

Thanks! Let's wait a few days for more comments and merge

peterdesmet commented 1 year ago

I'm just realizing that the use of enum (and thus enumLabels and enumOrdered) is not necessarily restricted to Table Schema, but could as well be used in e.g. a profile (see example). Should this pattern indicate that?

pschumm commented 1 year ago

I'm just realizing that the use of enum (and thus enumLabels and enumOrdered) [...] could as well be used in e.g. a profile [...]. Should this pattern indicate that?

I must admit that my thinking on this had been restricted to use when specifying statistical models, so while I see your point I can't immediately think of a use case. If you'd like to suggest some text, I'd be glad to add it to the pattern.

peterdesmet commented 1 year ago

@pschumm I have included a suggestion for the use in profiles at https://github.com/frictionlessdata/specs/pull/844#discussion_r1337356052. Since don't have edit rights to this branch, I had to make suggestions via a review. I also suggested some other minor changes and corrections to the text.

khusmann commented 1 year ago

Hi all, I'm new to this party but wanted to add a +1 to this direction y'all are going. I recently stumbled on frictionless and have been considering / wanting to use the standard in some of the social science & education research data collection efforts I'm a part of, but was seeing the lack of explicit support for value labels as a big barrier. So I'm excited to see the momentum in this thread. Thank you all for your work on this!

I want to add a couple of considerations here from a social science and edu research perspective that I haven’t seen mentioned yet (but I’m new & still catching up, so sorry if this has already been discussed!)

Consideration 1: I think one of the big reasons platforms like RedCAP, SAS, SPSS et al. prefer data in encoded form is that the numeric values of the ordinal scales are often substantively meaningful, especially for ordinal items designed to be combined into composite measures. In these cases, knowing only the ordering of the item labels does not “provide all of the information necessary” to use the item in practice.

For example, say we have the Likert scale physical_health as described in the proposal with levels "Poor", "Fair", "Good", "Very Good" and "Excellent", along with related variables mental_health and social_health, rated on the same Likert scale. If we were interested in a composite measure called health that was defined as the average of these three vars, we need the numeric mappings to calculate this. And now we have ambiguity: Likert scales are not always indexed starting at 1... a scale for "exercise_frequency" for example, ranging from "Never" to "Every day" may begin with 0. Also, in general, the value mappings aren’t necessarily evenly spaced… sometimes they are chosen to reflect a substantive non-linearity in levels when converted to numeric values (although I don't see this as often in the wild).

This means on a day-to-day practical level the categorical/factor implementations in pandas & R end up being pretty awkward to use, because it’s easy to lose the scale info. So I see social/edu researchers using these features a lot less than one might expect, in favor of keeping all their ordinal values numeric. But then, of course, you get a bunch of magic numbers in your code instead of labels, which isn’t great either… Some of this can be helped with the R labelled package, which gives you value labels a la SAS/SPSS/Stata, but most often I see people just making do with numeric types.

In my work, I’ve found it useful to represent schema definitions of enum levels as triplets: unique labels (str), with an associated value (int), and text (str). For example:

POOR:
    value: 0
    text: “I am feeling poor today”
FAIR:
    value: 1
    text: “I am feeling fair today”
GOOD:
    value: 2
    text: “I am feeling good today”
VERY_GOOD:
    value: 3
    text: “I am feeling very good today”
EXCELLENT:
    value: 4
    text: “I am feeling excellent today”

This gives me the most flexibility. I can convert to a values representation when I’m calculating a composite measure, I can use the label representation as a unique human-readable identifier in scripts (e.g. filter(physical_health == “VERY_GOOD”), and I also can access the exact wording used by the item in the survey (which can get quite long in some cases). That said, even just the simple value-label map as outlined in the proposal gets me 2/3 of the way there, and for now I can probably manage the text representations independently with custom props in my app… But I wanted to share the representation I'm using just in case it is useful to the current discussion.

Consideration 2: In a perfect world, I’d want to save / archive / publish ordinal data with their labels rather than numeric values, because they unambiguously reflect which level they represent even if the CSV becomes divorced from the schema, whereas numeric codes are opaque. How would we represent such a column with label->value codings in the current spec? Something like this?

{
  "fields": [
    {
      "name": "physical_health",
      "type": "string",
      "enum": [“Poor”, “Fair”, “Good”, “Very good”, “Excellent”, “Don’t know”, “Refused”, “Not applicable”]
      "enumOrdered": true
      "enumLabels": {
        “Poor”: "1",
        “Fair”: "2",
        “Good”: "3",
        “Very good”: "4",
        “Excellent”: "5",
        “Don’t know”: ".a",
        “Refused”: ".b",
        “Not applicable”: ".c"
      }
    }
  ],
  "missingValues": ["Don’t know","Refused","Not applicable"]
}

I suppose you can tell it’s a label->value map because the core item type is string, so you know to expect mappings to values? ...that feels a little implicit, but I guess it works. (This is one of the reasons I like having the ability to explicitly define keys on enum levels like I described above -- the definition of enumLabels can stay the same, and then you can you can just indicate if the underlying data is stored as labels, values, or text).

That said, in practice I find I more often produce CSVs with the numeric values instead of labels (as in the current enumLabels spec example) for maximum compatibility in the current software landscape – for example, some of my collaborators use versions of MPlus that cannot handle string labels

...Anyway, sorry for the long winded comment, but just wanted to chime in as a perspective from the education / social sciences world & let you know I appreciate the movement in this direction, and share some examples of how I would use this extension in my context. Cheers!

pschumm commented 1 year ago

Thanks very much @peterdesmet for the excellent edits—I have made all of them.

pschumm commented 1 year ago

Hi all, I'm new to this party but wanted to add a +1 to this direction y'all are going. I recently stumbled on frictionless and have been considering / wanting to use the standard in some of the social science & education research data collection efforts I'm a part of, but was seeing the lack of explicit support for value labels as a big barrier. So I'm excited to see the momentum in this thread. Thank you all for your work on this!

Great to hear from you @khusmann, and let me say at the outset that I think the concerns you raise are exactly on point. FWIW, in my work I use a combination of two strategies to address these: (a) demonstrate ways to avoid such problems by working differently; and (b) make sure that users have simple options to keep doing things exactly the way they've done before (if they don't want to change). I'll try to allude to these in a few specific comments below.

One of the reasons I avoided including suggested numeric codes in this pattern is that they can be both arbitrary and software specific. For example, I often find it convenient to use codes beginning with 0 for Yes/No items (i.e., treating "No" as 0 since 0 evaluates to False) or for those that have a naturally corresponding category; e.g., in the categories "Never," "Sometimes," "Frequently" and "All the time," the first category ("Never") can be thought of as naturally corresponding to 0. Making judicious use of 0 in such cases can make certain manipulations cleaner and easier. However, in my experience Stata, SAS and SPSS users frequently use codes starting with 1 regardless of the semantic nature of the categories; e.g., Stata users who use the encode command without a pre-constructed value label will automatically get a value label starting with 1. As for being software specific, Stata and SAS users can use extended missing values (.a, .b., .c, etc.) in their codes, while SPSS users are stuck using integers within a designated range (e.g., -97, -98, and -99). Note that even if suggested numeric codes are not part of the official Table Schema, they can always be included as additional custom metadata if someone wants to do this (but see Comment (3) below for an alternative).
Your point on scoring (i.e., summing a set of Likert items whose response categories have been mapped to integers to get an overall score) is a good one, though I would argue that summing items based on the assumption that they are coded a certain way can lead to errors that are then difficult, if not impossible, to catch. It doesn't take much code to be explicit and thereby reduce the likelihood of such errors, e.g.:
```
df['cesd'] = (
    df[cesd_items]
    .replace({'Rarely or none of the time': 0,
              'Some or a little of the time': 1,
              'Occasionally or a moderate amount of the time': 2,
              'Most or all of the time': 3})
    .sum(axis=1)
)
```
Of course, how expressive you can be depends on the language you are working in, and for that reason even though I do my analyses in Stata, I tend to do my data manipulation in Python. Still, I would argue that being forced to pay more attention to the actual response categories can cut down on the risk of errors when scoring and analyzing your data.
Finally, rather than add suggested numeric codes to a tabular schema as custom metadata, I tend to store them in an ancillary file, e.g.:
```
{
  "value_labels": {
    "spopen": {
      "0": "Never",
      "1": "Hardly ever or rarely",
      "2": "Some of the time",
      "3": "Often",
      ".a": "Refused",
      ".b": "Don't know",
      ".c": "Not applicable"
    }
  },
  "field_to_value_label": {
    "spopen": "spopen",
    "sprely": "spopen",
    "spdemand": "spopen",
  }
}
```
This way, I don't fill up the schema with a lot of information that is only useful to some users, and I can maintain separate files containing codes for Stata/SAS (including extended missing values) and SPSS (strictly integers). Also, I can permit people to modify this ancillary file as desired for specific purposes, without fear of compromising the integrity of the data and their metadata. And when creating a data package for distribution, I use the schema, together with this ancillary file (if it exists), to generate a script (e.g., a do-file for Stata) that may be used by the target software to read and format the data according to the schema, and to create and apply the value labels as specified by the ancillary file. I then distribute this script as part of the data package. This way, a Stata user, say, can make full use of the data without having to learn anything about Frictionless or having to install any special packages. At the same time, you get the benefits of Frictionless, e.g., metadata to make the data FAIR, ability to explore, validate and transform the data using Frictionless tools, data that are usable by the broadest range of users regardless of their software, etc.

khusmann commented 1 year ago

@pschumm Thanks for your thoughtful reply! It sounds like we have very similar approaches & philosophies. In the spirit of sharing different strategies, I want to highlight the similarities and differences of our workflows and needs, with an eye towards future extensions that might subsume our differences.

As you say, for a lot of categorical and ordinal data, the underlying numeric codes do not have special significance. In these cases my approach is pretty much identical to yours, except I encode the levels in my CSV as short “labels”, instead of the level text. Then, in my processing scripts I can convert to values as you describe, but don’t have to use the long item text:

df['cesd_values'] = (
    df[cesd_items]
    .replace({'RARELY': 0,
              'SOMETIMES': 1,
              'OCCASIONALLY': 2,
              'MOST_OR_ALWAYS': 3})
)

I like working with these “labels” representing levels rather than level text, because it gives me short identifiers to labels which are easier to read & type in scripts, rather than the entire level text. (it gets really useful in filtering, grouping, selecting & otherwise slicing & dicing the data). And it also avoids the “summing items based on the assumption that they are coded a certain way” problem, as you describe.

If I ever need the exact item text, I do a similar transformation:

df['cesd_text'] = (
    df[cesd_items]
    .replace({'RARELY': 'Rarely or none of the time',
              'SOMETIMES': 'Some or a little of the time',
              'OCCASIONALLY': 'Occasionally or a moderate amount of the time',
              'MOST_OR_ALWAYS': 'Most or all of the time'})
)

Like you said, in a lot of cases, the label -> value mappings are not meaningful. In these cases, I agree, the best practice should be to avoid putting implementation-specific “suggested” values into the schema. (The label->text information, on the other hand, I consider a part of the item definition, and so I include this map with my custom schema props).

In some cases though, I do think label -> values mappings are meaningful, and so I like including a label -> value map in my schema in addition to my standard label -> text map. For example, on a group of items that can be aggregated to produce scores on scales that have been normed or standardized across a population. In this case, their underlying values are not merely a suggestion, but I would argue become an intrinsic attribute of that item’s scale, in the same way that the level text is an intrinsic attribute of the item’s display. It’s useful having this info in the global schema, because I want all scripts / implementations reading the schema to be sure to translate values using this map in their scoring calculations, and I want these values to be noted in the codebooks I generate with the schema. (Like you, I maintain scripts that convert to SAS / Stata / etc representations and handle the implementation-specific parts)

I think the potential meta-pattern here is that, in general, enum levels may have multiple attributes that are relevant for a complete description and use of the data. The abstraction progression I’m seeing is something like this:

1) Factor definitions give enum levels a single attribute: a label 2) Value-labels give enum levels two attributes: a value and label. 3) In my work I’m giving enum levels up to three attributes: value, label, and text. 4) It’s possible other types of data could benefit from the flexibility to define more (custom) attributes on enum levels that are relevant to the item’s definition (e.g. an extended description of the level? other representations of the option's text, like with html or rtf formatting? duration of time that particular option was displayed in a task that sequentially displays options at different intervals? alternate scales or units? etc etc)

The possible uses of extended enum level-attributes is something I’m still chewing on, and want to think it through a lot more before proposing anything…I’d be curious to hear more of your thoughts / reactions, if you’re up for it! In the meantime though, the present addition of value-labels in this PR is going to go a long way to represent the data I’m putting together. Thanks again for your work and for this discussion! :)

pschumm commented 12 months ago

Thanks @khusmann for sharing this additional information. I think one key issue here involves distinguishing between which metadata belong in a Table Schema and which belong elsewhere. And fortunately, the ability to use custom schema properties without creating incompatibilities with the standard tools gives users (and vendors) a lot of flexibility to make their own choices.

I would just add two more comments:

I agree with your statement that in some cases the "underlying values are not merely a suggestion, but [...] an intrinsic attribute of that item’s scale" but I believe this may be leading us to separate conclusions. Specifically, I believe they are an attribute of the scale, not the item itself. For example, different scales that use the same item may score it differently. One could imagine implementing (and publishing) a specific scale as a Frictionless Transform that takes as input the textual data and generates a score. In any case, for me this supports the argument that numeric codes don't belong in a Table Schema unless they are merely indicating how the stored data are encoded (i.e., how the response categories are numerically encoded) in the corresponding file.
As you may be aware, there has been considerable discussion about how to handle additional sets of category labels as you describe (e.g., see here). One additional example of this that I haven't seen mentioned involves internationalization (i.e., use of separate sets of labels for different languages). Options that have been suggested for handling multiple sets of category labels include the following:
- Custom property in schema (as noted above)
- Use of the rdfType property
- Foreign key to a separate tabular resource containing one or more additional sets of labels
- Use of a Frictionless Transform that switches from one set of labels to another

In sum, the purpose of the specific pattern proposed here is simply to include in the standard Table Schema the minimal information necessary to work effectively with categorical data (primarily from an analytic viewpoint), excluding anything that is software-specific. And in this case, a lot is software specific, given the fact that the different analytic packages have such different features and functionality. What would then be nice, I think, would be to create a space where those of us implementing and/or working with specific software can share and discuss the features of those software implementations.

khusmann commented 12 months ago

Thanks @pschumm for your additional comments. It sounds like we’re actually very much on the same page regarding big-picture direction here. As you say, it looks like the minor differences in perspectives relate to how to parsimoniously (but still flexibly & inclusively) define the separation between what is software specific and what should be natively understood / archived by the schema definition.

What would then be nice, I think, would be to create a space where those of us implementing and/or working with specific software can share and discuss the features of those software implementations.

I wholeheartedly agree! We’re wading into territory beyond this specific PR, and I think rather than responding to your points above in this thread it’d be nice to find a space for continuing this conversation in a space more tailored for it.

I’m new to the frictionless scene, so do you know of spaces that already exist that would be a good fit for continuing this conversation in? If not, would you potentially be interested in co-organizing something with me? Issue threads are good, but easy to lose momentum in… It’d be nice to have a meeting every once in a while to discuss bigger picture ideas (e.g. patterns for handling “scales” across different software implementations) to help strategise / structure collaboration efforts as we build features in these directions on our own projects that may have application to the wider community.

…and speaking of collaboration in this direction, I’m happy to put together a PR implementing this extension in the frictionless-r package, if nobody else has started on this yet!

peterdesmet commented 12 months ago

…and speaking of collaboration in this direction, I’m happy to put together a PR implementing https://github.com/frictionlessdata/frictionless-r/issues/148, if nobody else has started on this yet!

I'm the maintainer of that R package and that sounds excellent @khusmann! We currently have a number of changes lined up for a version 1.1 of the package, which I hope to release before the end of the year. I think implementing the functionality proposed in this PR would be good for a version 1.2, but could also be considered for 1.1 depending timing.

khusmann commented 12 months ago

I'm starting to implement this extension in frictionless-r and got some clarification questions on the spec:

1) It looks like it is possible to have an enum constraint OR an enumLabels property.

What does it mean when an enumLabels property exists without an enum constraint?

What does it mean when a key exists in enumLabels that is not present in the enum constraint, and vice versa?

2) I'm a little confused by this line:

The absence of an enumOrdered property MUST NOT be taken to imply enumOrdered: false.

If there is a enum constraint or enumLabels is defined, I'm constructing a categorical/factor variable, right? Categorical vars are either unordered or ordered, so if enumOrdered is not defined and I should not interpret this as enumOrdered: false, does this mean I should make them ordered by default (enumOrdered: true)?

pschumm commented 12 months ago

What does it mean when an enumLabels property exists without an enum constraint?

In some situations you may want to label specific values but not impose a enum constraint. For example, consider a continuous laboratory value where specific reasons for missingness (e.g., poor sample quality, sample not obtained, etc.) are coded in the data with designated values (e.g., .a, .b, .c or -97, -98, -99). In this case, you may wish to label these specific missing value codes.

What does it mean when a key exists in enumLabels that is not present in the enum constraint, and vice versa?

One example where some values in the enum constraint may not be labeled is where you have a discrete, numeric scale with only the endpoints labeled (e.g., 0 "No pain", 1, 2, [...], 10 "Worst pain imaginable"). An example of the reverse (i.e., labeled values that do not appear in the enum constraint) might be the label 0 "Male" 1 "Female", but your dataset represents a study of women only and therefore has an enum constraint of ["Female"]. One might argue that the latter is a bit contrived, but I see no reason to exclude it.

The absence of an enumOrdered property MUST NOT be taken to imply enumOrdered: false.

If there is a enum constraint or enumLabels is defined, I'm constructing a categorical/factor variable, right?

IMO an enum constraint should imply that the field is categorical. However, the presence of enumLabels alone should not, based on the example above; note that it should still have a value label constructed if the software supports it.

Categorical vars are either unordered or ordered, so if enumOrdered is not defined and I should not interpret this as enumOrdered: false, does this mean I should make them ordered by default (enumOrdered: true)?

A good question. I must confess, when I was writing this I was thinking about pandas Categorical which uses a nullable boolean (with default None) for its ordered option (for a detailed discussion of this choice, see this issue). But I'm not sure we need that complexity here. Indicating that a categorical is ordered invokes additional behavior (e.g., computing min and max, sorting in logical instead of lexical order), whereas there are no special behaviors invoked for categoricals not designated as ordered. So I think a plain boolean would probably work here, with false as the default. But if we take that as the default, then it can't be interpreted as a statement about the level of measurement. That is what I meant by the statement above.

pschumm commented 11 months ago

I’m new to the frictionless scene, so do you know of spaces that already exist that would be a good fit for continuing this conversation in?

I don't, but I'd defer to the community leaders here to ensure that whatever we do is both helpful and consistent with existing initiatives and workflows. I think that the defining feature of what we're doing here is that it is focused on the use of Frictionless data packages/resources across a broad range of analytic software (e.g., Stata, R, SAS, SPSS, Pandas, Julia). Thus, while one could imagine a section on categoricals or value labels in the documentation for separate plugins for each of these software packages, that wouldn't be very efficient nor would it permit cross fertilization.

My go-to here would be to start with a dedicated GitHub repository containing files in reStructuredText or Markdown format, rendered via Sphinx and exposed via GitHub Pages. This is both easy to edit and easy to consume. The content would be ideas, tips, etc. for using Frictionless data packages/resources as part of statistical analyses in fields like biomedical research, social science research, etc. (these are intended as examples only—not meant to be exclusive). Once there is a critical mass of content, it can always be moved elsewhere and/or reorganized.

But as I said, I'm glad to defer to the wisdom of others on this.

khusmann commented 11 months ago

IMO an enum constraint should imply that the field is categorical.

I agree. To summarize:

enum constraint & enumOrdering: false => categorical
enum constraint & enumOrdering: true => ordinal.

if we take that [enumOrdering: false] as the default, then it can't be interpreted as a statement about the level of measurement. That is what I meant by the statement above.

Agreed. Perhaps another way of saying this is "If there is an enum constraint, the default value of enumOrdering SHOULD be false. If there is no enum constraint, the enumOrdering property SHOULD NOT be defined".

However, the presence of enumLabels alone should not [imply that the field is categorical]

Hmm, in that case, when enumLabels is used on a non-enum type (e.g. an integer or numeric), having "enum" in the name may be a little misleading. In this case, they're no longer labels for an enum values; they're labels for integer or numeric values.

Connecting this back to something you said earlier:

for me this supports the argument that numeric codes don't belong in a Table Schema unless they are merely indicating how the stored data are encoded (i.e., how the response categories are numerically encoded) in the corresponding file.

I can get behind this way of thinking. Along with your examples it helps clarify how numeric codes (as used by SPSS/Stata et al) are tangling at least two separable concepts: "value labels", and "encoding".

Clear examples of "encoding" would be designated values for missingness (e.g., .a, .b, .c or -97, -98, -99), or a categorical measure (say, 0: MALE, 1: FEMALE), where the numeric levels don’t mean anything. In these cases, value being stored doesn't have any significance to the schema; it's entirely implementation-specific. We only mention them in the schema for the purpose of translating the stored file; similar to CSV dialect.

By contrast, the two labels anchoring a pain scale from 0 - 10 are not an "encoding", they're "value labels", that is, extra metadata attached to already-meaningful values.

Given that these are two separate concerns, what if we split the enumLabels property into a storageEncoding property and a valueLabels property? The storageEncoding property would map values from their implementation-dependent storage format to an implementation-independent form that would be used in the rest of the field & schema definitions. The valueLabels property would be just that, extra metadata attached to already-meaningful values, allowable in the context of integer and number types (including integers and numbers with enum constraints).

For example, here's a 0-5 pain scale with two value label anchors, with Stata-encoded missing reasons:

{
  "fields": [
    {
      "name": "physical_pain",
      "type": "integer",
      "enum": [1,2,3,4,5]
      "enumOrdered": true,
      "valueLabels": {
        "1": "No pain",
        "5": "Super painful",
      },
      "storageEncoding": {
        ".a": "Don't know",
        ".b": "Refused",
        ".c": "Not applicable"
      },
    }
  ],
  "missingValues": ["Don't know", "Refused", "Not applicable"]
}

And a binary gender scale, stored with SPSS-like encodings (Note here how 0: Male and 1: Female are explicitly noted as an encoding and so do not pollute valueLabels or the enum definition).

{
  "fields": [
    {
      "name": "binary_gender",
      "type": "string",
      "enum": ["Male", "Female"]
      "enumOrdered": false,
      "storageEncoding": {
        "0": "Male",
        "1": "Female",
        "-97": "Don't know",
        "-98": "Refused",
        "-99": "Not applicable"
      },
    }
  ],
  "missingValues": ["Don't know", "Refused", "Not applicable"]
}

And an integer (non-enum) scale with two value labels:

{
  "fields": [
    {
      "name": "pick_a_number",
      "type": "integer",
      "valueLabels": {
        "7": "A magic number",
        "18": "Voting age",
      },
      "storageEncoding": {
        ".a": "Don't know",
        ".b": "Refused",
        ".c": "Not applicable"
      },
    }
  ],
  "missingValues": ["Don't know", "Refused", "Not applicable"]
}

The advantage with an approach like this is that by decomposing the two roles that enumLabels plays into two separate properties, it allows us to better isolate specific implementation format from metadata / schema information: The information contained in valueLabels is universally meaningful metadata I'd want to include in a generated codebook for the dataset, and has different levels of native implementation support. storageEncoding, by contrast, would be something I'd only optionally include in a codebook if the user wanted to understand the implementation-specific nature of the file storage format, but is still necessary to include in the schema in order for all implementations to know how to properly load the file.

The disadvantage would be, of course, a little more complexity.

[Full disclosure: one of the projects I'm working on is & hope to release soon an "interactive codebook" viewer for frictionless packages that renders histograms, etc. based on type / schema information… So that's part of my motivation for bringing this up]

Thus, while one could imagine a section on categoricals or value labels in the documentation for separate plugins for each of these software packages, that wouldn't be very efficient nor would it permit cross fertilization.

I agree.

I'd defer to the community leaders here to ensure that whatever we do is both helpful and consistent with existing initiatives and workflows. I think that the defining feature of what we're doing here is that it is focused on the use of Frictionless data packages/resources across a broad range of analytic software (e.g., Stata, R, SAS, SPSS, Pandas, Julia).

Same here, I'm interested in doing whatever is most helpful, and I defer to whatever the community leaders think is best. And I agree with your characterization of what we're doing!

pschumm commented 11 months ago

Given that these are two separate concerns, what if we split the enumLabels property into a storageEncoding property and a valueLabels property? The storageEncoding property would map values from their implementation-dependent storage format to an implementation-independent form that would be used in the rest of the field & schema definitions. The valueLabels property would be just that, extra metadata attached to already-meaningful values, allowable in the context of integer and number types (including integers and numbers with enum constraints).

I definitely see your argument, but I think it's just a matter of balancing the conceptual advantages against the additional complexity. For example, one might argue that the enum property really shouldn't be a constraint but rather a type (as it typically is). However, if you made it a type, then you'd need to coerce to something sensible every time you were working in a language that doesn't have enums. Making it a constraint is intuitive, and still permits those of us who want to map the corresponding field to a type (e.g., categorical or similar) to do so.

One quick comment about nomenclature: The name "storageEncoding" is almost synonymous with file encoding (e.g., UTF-8), so I don't think that would work. The chance for misinterpretation or confusion is too high.

If you look at the original version of this PR, you'll see that my own thinking on this has evolved a bit thanks to comments by @peterdesmet (I started with something a bit more complex). And I've come to rather like the result. So the only thing I know to do is to offer my own best endorsement/explanation of the current proposal, with the understanding that I'm glad to defer to the Frictionless design/development team if they want to make modifications.

The property enumLabels provides a set of labels for an enumerated set of values that software may choose to handle but it is not required. The set of values may be equivalent to the set of values specified in the enum property, but that is not necessary; one is a constraint, while the other are informative metadata. Including this in the Frictionless Table Schema accommodates at least two important use cases:

_1. Datasets written to a CSV file in "encoded" form by commonly used software such as Stata, SAS, SPSS, REDCap, etc.

Existing, often discipline-specific conventions used when writing CSV files, such as files in genetics and genomics using 1 to indicate "Male" and 2 to indicate "Female". I'm sure there are better examples (that one is so antiquated)._

More generally, enumLabels permits labeling of any set of enumerated values that may occur in the data. And again, software can decide what to do with them (or not).

Finally, it is understood that for certain software and/or specific purposes, there may be a desire for additional metadata in conjunction with discrete/categorical fields (e.g., additional value label mappings). Since these are by definition special purpose, they can be freely included in additional properties in the schema, in supplementary files in the data package (e.g., JSON/YAML file or another tabular resource linked via a foreign key), or in an rdfType.

That's the best I can do based on my current thinking, but as I said, glad to defer to the Frictionless team if they want to make changes.

khusmann commented 11 months ago

I think it's just a matter of balancing the conceptual advantages against the additional complexity.

Agreed. Another dimension we're balancing is backwards compatibility, which I think is another one of the advantages of the current proposal as it stands.

For example, one might argue that the enum property really shouldn't be a constraint but rather a type (as it typically is). However, if you made it a type, then you'd need to coerce to something sensible every time you were working in a language that doesn't have enums.

Exactly. This is a part of the larger conversation I'm interested in continuing (somewhere outside of this PR). The present software landscape has representations that support some concepts, but not others, and then additionally have features that conflate separable concepts. It creates quite a tangle. I'd really love to continue working on this with folks like yourself from different substantive & implementation backgrounds to build a map of these concepts (what are the separable, higher-order types / constructs?) and ways they fit together (what is preserved / lost when translating between implementations / representations?). The goal here, of course, is to maximize "cross-pollination" of data across implementations & representations, as you put it.

In addition to enums, I think the way missingness and its associated metadata is represented (in general) is also a big piece of the puzzle. I notice you originally had a field-specific property for "missingValues" in your original proposal, which reinforces to me that we're thinking along similar lines (thanks for pointing me there, by the way, I didn't realize so much changed from the earlier revisions). I have some ideas regarding representations of missingness across implementations, but yeah, all this is a conversation for a different context, and I defer to community leaders for the best way to facilitate it.

In the meantime, sorry I've taken up so much space in this PR, I wasn't meaning to delay this from going through and to open a can of worms! I really appreciate your patience with me @pschumm as I've been getting up to speed. I hope I can repay some of your time by helping to put together PRs that implement this extension in different libraries.

One quick comment about nomenclature: The name "storageEncoding" is almost synonymous with file encoding (e.g., UTF-8), so I don't think that would work. The chance for misinterpretation or confusion is too high.

Totally agree. That name was just what I came up with at the spur of the moment. Maybe valueRepresentation instead? valuePremap? levelCoding? Actually, levelCoding doesn't seem half bad. :)

And I've come to rather like the result.

Me too! I realize this is a careful compromise across a lot of dimensions, and that we should not let conceptual perfection become the enemy of the practical "good enough".

Thanks again for the engaging discussion and for all your work spearheading this proposal @pschumm! Like I said at the beginning, better enum support in frictionless is key for its adoption in a lot of the groups I'm a part of. I'll get back to work on a potential implementation of this extension in frictionless-r and report back if I run into any issues or edge cases.

khusmann commented 11 months ago

PS: I don't have edit access so I'm not sure where else to put this minor edit but I think this example is missing a constraints definition and should read constraints: { "enum": [1,2,3,4,5] }, right? (Also a comma after the enumOrdered prop, and both need fixing in the subsequent example.)

pschumm commented 11 months ago

PS: I think this example is missing a constraints definition and should read constraints: { "enum": [1,2,3,4,5] }, right? (Also a comma after the enumOrdered prop, and both need fixing in the subsequent example.)

Indeed—thanks! I've just fixed these.

The goal here, of course, is to maximize "cross-pollination" of data across implementations & representations, as you put it.

Exactly. Most packages now have importers or exporters to/from other formats (e.g., SAS can export a Stata file, and Stata can import files from SAS or SPSS), but those don't address the need to create open, software-agnostic data resources that are broadly and easily accessible (i.e., frictionless).

In addition to enums, I think the way missingness and its associated metadata is represented (in general) is also a big piece of the puzzle. I notice you originally had a field-specific property for "missingValues" in your original proposal, which reinforces to me that we're thinking along similar lines.

I agree, which is why I've been trying to respond. And I agree that the way missingness is handled is also an important part of facilitating use of Frictionless among various fields and with specific software. I decided to drop it from this pattern, partly in response to discussions here and elsewhere, to better maintain separation of concerns and to keep things moving along.

In the meantime, sorry I've taken up so much space in this PR, I wasn't meaning to delay this from going through and to open a can of worms!

No need for apologies. I'm committed to helping to make use of Frictionless seamless for health and social scientists (including the data repositories that they use), and for statistical analysis across the widest possible range of software. It's terrific to meet others with similar objectives who are willing to work together.

roll commented 10 months ago

Thaks a lot @pschumm!

Dear all, are we ready to merge? It looks really good for me