Closed pschumm closed 4 months ago
Would V2 be an opportunity to implement this pattern as a top-level field type rather than as an enum constraint on other types? e.g, instead of :
{
"fields": [
{
"name": "physical_health",
"type": "string",
"constraints": {
"enum": [
"Poor",
"Fair",
"Good",
"Very good",
"Excellent",
]
}
"enumOrdered": true
}
},
"missingValues": ["Don't know","Refused","Not applicable"]
}
something like:
{
"fields": [
{
"name": "physical_health",
"type": "enum",
"values": [
"Poor",
"Fair",
"Good",
"Very good",
"Excellent",
],
"ordered": true
}
},
"missingValues": ["Don't know","Refused","Not applicable"]
}
As a top-level type, this makes the definition much simpler / cleaner / easier to parse & implement in type systems, because you can immediately detect it is a categorical / ordinal enum type rather than needing to drill into constraints. Also, we can drop the "enum" prefix on enumLabels and enumOrdered because it's clear the attributes are being applied in an "enum" field scope. (and these props don't make sense on other primitive types!)
Given that we already have types like "year" implemented as top-level types rather than constraints on primitive types, I think there's an argument for categorical & ordinal enums to receive similar top-level standing given their extensive use in the biobehavioral, medical and social sciences as @pschumm referenced...
The addition of a new field type like this would not violate @peterdesmet 's proposed rules for spec changes (#858 comment), that is:
Thoughts?
@khusmann It's quite a substantial change, but it would indeed be nice to give some more love to enum
, making values
, labels
and ordered
properties at the same level as bareNumber
and groupChar
are for numbers.
What we loose is not being able to declare a non-string type
, like date
. Let's say project_start
is non ISO date and can only have 2 values (an enum
, min/max is no replacement):
{
"name": "project_start",
"type": "enum",
"values": [
"01/03/2023",
"01/04/2023"
],
"format": "%d/%m/%y"
}
An implementation would have to guess somehow that project_start
is to be interpreted as a date
field. The alternatives are:
type
as date
, but lift enum
up from constraints
as values
.enum
, but only allow string values
.Thoughts?
@peterdesmet excellent points.
Conceptually, I'm imagining the enum
type to be specifically for representing categorical / ordinal variables, that is, a field with a distinct set of levels that may or may not be ordered. So I would argue we should only allow string values
in enum
types, where they act as labels to represent these abstract levels.
By contrast, I see an enum
constraint to be a validation rule on an existing type. So in your project_start
example, I'd represent this the traditional way -- it's a date
field, but with a validation rule:
{
"name": "project_start",
"type": "date",
"constraints": [
enum: ["01/03/2023", "01/04/2023"]
],
"format": "%d/%m/%y"
}
With a definition like this, I'd expect implementations to read project_start
as a date
type, not as a categorical / factor type. It's a "date with validation constraints"; not a categorical / ordinal variable.
Perhaps the use of the enum
keyword is confusing being used in both these contexts, and we should use a different name for the type? (e.g. "type": "categorical"
)
The key argument here is that categorical / ordinal fields are a conceptual type distinct from string, number, date, etc. They are not just constrained string or numeric fields... they are fundamentally a different type of field with different properties, which I think affords them their own first-class field definition, rather than having their existence be implied from a validation constraint.
I’m in favour of a ”type": “categorical”
as a separate concept of constraints.enum
. It’s non-breaking and will make implementation more contained and straightforward.
I very much like this idea, and note that it will emphasize the correspondence between the proposed categorical
type in Frictionless and the dtype category
in Pandas, factors in R, or a CategoricalVector
in Julia. It will also simplify the necessary code, a point @khusmann made above.
For encoded data, I presume the corresponding specification would be:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [1,2,3,4,5]
"ordered": true,
"labels": {
"1": "Poor",
"2": "Fair",
"3": "Good",
"4": "Very good",
"5": "Excellent"
}
}
],
"missingValues": ["Don't know","Refused","Not applicable"]
}
I suppose there is some ambiguity between "values": [1,2,3,4,5]
and "values": ["1","2","3","4","5"]
; the quotes around the keys in the labels
property being necessary merely due to the JSON spec. But I think this is a very minor issue that can be dealt with in a note to implementors (my instinct would be to permit either).
Thanks for bringing up the encoded categoricals @pschumm, I think this is worth more discussion.
I think the values
array in the categorical type should always be specified as logical values, that is, string labels that represent the abstract levels of the categorical. This would also have the nice side effect of always defining the levels of the Pandas / R / Julia categorical type when it is imported.
By contrast, as we've discussed before in previous threads on categoricals, numeric encodings of categoricals are physical values. When gender is encoded 1: Female, 2: Male, the logical values of the categorical field are ["Male", "Female"]
, even though its physical values are ["1", "2"]
.
Therefore, for the proposed categorical type, I would argue encodings should be expressed in a field more akin to trueValues
and falseValues
on boolean types, that is, mappings from logical values to physical values. Something like:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [
"Poor",
"Fair",
"Good",
"Very good",
"Excellent",
],
"ordered": true,
"codes": {
"Poor": "1",
"Fair": "2",
"Good": "3",
"Very good": "4",
"Excellent": "5"
}
}
],
}
This way, the values
field is always holding logical values. (and we can always use those as the names of levels when importing into Pandas / R / Julia / etc.)
That said, I see two issues with this approach, which I outline below with potential solutions:
1) Partially labeled scales. Some scales do not have labels for all their levels. For example, suppose physical_health
only had "Poor" and "Excellent" anchors, and the rest of the levels were unnamed. (e.g. the question was "On a scale from 1 to 5, 1 being Poor, and 5 being Excellent, how do you rate your health?")
Conceptually, how do we identify the levels of this kind of field? Relatedly, how do we envision this variable being imported into Pandas / R / Julia?
One approach could be to make it a categorical with levels ["Poor", "2", "3", "4", "Excellent"], which would be represented by the following field definition:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [
"Poor",
"2",
"3",
"4",
"Excellent",
],
"ordered": true,
"codes": {
"Poor": "1",
"Excellent": "5"
}
}
],
}
(Where physical values without codes simply pass through untransformed)
2) Labeled missingness. In the current spec, missingValues
are always defined as physical values. So I think @pschumm 's earlier example (using the labels
approach) would actually look something like:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [1,2,3,4,5]
"ordered": true,
"labels": {
"1": "Poor",
"2": "Fair",
"3": "Good",
"4": "Very good",
"5": "Excellent",
"-97": "Don't know",
"-98": "Refused",
"-99": "Not applicable"
}
}
],
"missingValues": ["-97", "-98", "-99"]
}
(Where missingValues
now holds physical values (codes), instead of logical values (labels)) (edit: changed missing values to all negative for clarity)
Alternatively, we could support logical missing values in the same manner to the encoded categorical approach I'm proposing above, by adding a missingCodes
field:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [
"Poor",
"Fair",
"Good",
"Very good",
"Excellent",
],
"ordered": true,
"codes": {
"Poor": "1",
"Fair": "2",
"Good": "3",
"Very good": "4",
"Excellent": "5"
}
}
],
"missingValues": ["Don't know","Refused","Not applicable"]
"missingCodes": {
"Don't know": "-97",
"Refused": "-98",
"Not applicable", "-99"
}
}
This would enable missingValues
to now be specified with logical values, and mirror the behavior of the categorical type. I think this is actually really nice and consistent, because missing values are indeed a categorical type!
(This would also work on a per-field missingValues
basis as well, and be useful for all field types, not just categorical fields)
I realize adding a missingCodes
field would be another big change but I think it is intertwined with the spec for categorical types for the reasons I mention above. That said, it is still within @peterdesmet 's proposed rules for V2 spec changes:
1) It would not invalidate previous datapackage.json files
2) A datapackage with the new missingCodes
field would be invalid for software that did not support the new field yet, but this is OK
Thoughts?
@khusmann thanks for pushing this forward. Some thoughts below.
TL;DR: I think I prefer @pschumm original approach: I find it more straightforward and think it can be even simplified further. I do agree with the correction you made, where missing values should have the physical values ("missingValues": ["-97", "-98", "-99"]
).
I think values
should hold the physical values. This is already the case for trueValues and falseValues (e.g. lists the physical "True"
for logical true
) and missingValues (e.g. lists the physical "-99"
for logical null
) and it would be good to align with that.
It is still possible to derive levels directly from values
:
null
.levels
parameter in factor():
The default is the unique set of values taken by as.character(x), sorted into increasing order of x.
# Data are strings, values are not defined
data <- c("Male", "Man", "Male", "Female", "Lady", "Undefined")
factor(data)
#> [1] Male Man Male Female Lady Undefined
#> Levels: Female Lady Male Man Undefined
# Data are strings, values are defined (but don't map entirely)
values <- c("Male", "Man", "Female", "Lady", "Nonbinary", "Declined")
factor(data, levels = values)
#> [1] Male Man Male Female Lady <NA>
#> Levels: Male Man Female Lady Nonbinary Declined
# Data are integers, values are not defined
data <- c(1, 2, 1, 3, 4, -99)
factor(data)
#> [1] 1 2 1 3 4 -99
#> Levels: -99 1 2 3 4
# Data are integers, values are defined (but don't map entirely)
values <- c(1, 2, 3, 4, 5, 6)
factor(data, levels = values)
#> [1] 1 2 1 3 4 <NA>
#> Levels: 1 2 3 4 5 6
Created on 2024-02-16 with reprex v2.1.0
Male
/Man
, Female
/Lady
). This can be achieved with the labels
property. factor() in R supports this with the labels
parameter, where:
Duplicated values in labels can be used to map different values of x to the same factor level.
data <- c(1, 2, 1, 3, 4, -99)
values <- c(1, 2, 3, 4, 5, 6)
labels <- c("Male", "Male", "Female", "Female", "Nonbinary", "Declined")
factor(data, levels = values, labels = labels)
#> [1] Male Male Male Female Female <NA>
#> Levels: Male Female Nonbinary Declined
Created on 2024-02-16 with reprex v2.1.0
We would opt to simplify the current value: label
proposal for labels
to an array in the same order and with as many elements as values
. That would also avoid the 9
vs "9"
issue. Not sure if there is functionality we would loose and it is clear enough.
values
as null
, then there is likely no need for missingValues
either:
null
, before any further processing steps.constraints.enum
values
. Personally I prefer that it is optional and doesn't need to encompass all present values.Resulting syntax:
{
"fields": [
{
"name": "gender",
"type": "categorical",
"values": [1, 2, 3, 4, 5, 6]
"labels": ["Male", "Male", "Female", "Female", "Nonbinary", "Declined"]
"ordered": false,
"missingValues": ["NA"]
}
],
}
I agree with @peterdesmet that values should contain the physical values present in the data—that is, as he notes, more consistent with other elements of the standard and more intuitive. It also makes the schema easier to read by a human being if the data contain a mix of fields, some represented in the file by their logical values and others represented by codes.
I also like the proposed simplification of changing the labels
property to an array of the same length as values
. The only problem I see is that it will be more difficult to read in cases where the number of values/labels is large (e.g., prescription medications in a drug database). I believe this is the exception rather than the rule, but even in the case of a modest number of value/labels (e.g., ~6 or more), it could make manual edits to a schema more error-prone. I could go either way here.
The one thing I find counterintuitive would be to permit values
to have fewer values than are present in the data, once all values declared in missingValues
have been accounted for. IMO that would make the schema more difficult to read and interpret, and perhaps more importantly, would permit potentially serious errors to pass silently during validation. More values than are present in the data, sure, that makes sense. But fewer values strikes me as being too implicit. What would the harm be in requiring that all of the observed values be present in the case of a categorial variable?
Finally, I would just note that I agree missingValues
should always contain physical values (as they do now). The example I gave above was intentional; for example, this is how REDCap exports data by default (i.e., categorical variables get exported with their numeric codes, except for defined missing values which are reprsented by their labels). So we should accommodate that case even if we wouldn't choose to write data that way. Thus, in the example here, software such as Python or R that cannot represent multiple types of missing values would ignore (i.e., treat as null
) the values in missingValues
, while software such as Stata, SAS or SPSS that can represent multiple types of missing values could automatically incorporate the values in missingValues
as extended missing values (Stata or SAS) or negative integers (SPSS).
Finally, let me say how much I appreciate you guys engaging so deeply here. I feel strongly that once we arrive at a final resolution, this will have an enormous impact on the utility of Frictionless in the disciplines within which I work (and probably others). And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!).
I find my proposal to have values
and labels
to be two arrays of the same length a bit clunky and hard to read (especially for many values). I think we can combine it into one property:
{
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
{"value": 1, "label": "Male"},
{"value": 2, "label": "Male"},
{"value": 3, "label": "Female"},
{"value": 4, "label": "Female"},
{"value": 5, "label": "Nonbinary"},
{"value": 6 }
],
"ordered": false,
"missingValues": ["NA"]
}
]
}
categories
as a name aligns well with "type": "categorical"
.categories
(levels/values) is still an array, so it is possible to order them (ordered
).value
is the physical value. In contrast with the first proposal for labels/enumLabels
, values don't need to be wrapped in double quotes, since they are not keys.label
is directly associated with value
, fastly improving readability.label
should be optional (since not all data providers will want to provide it), but now it is optional at a value level. Implementations should just use the value if a label is not provided.But fewer values strikes me as being too implicit. What would the harm be in requiring that all of the observed values be present in the case of a categorial variable?
My suggestion to allow fewer values was based on factor()
being able to deal with those. We already have constraints.enum
to validate unexpected values, but it might be good to include that functionality in categories
as well. I'm a bit on the fence if it is a good design decision to have both constraints.enum
and categories
as methods of defining that or if we should reserve validation for constraints.enum
only.
And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!).
Yes (even if we do talk about it 😄 ) 🍻
Love it @peterdesmet! Personally, I would prefer the restriction that we list all of the categories and don't have to secondarily include an enum
property to do validation, but I could live with a group decision on that.
I like where we're going with this! Especially @peterdesmet 's proposed categories
prop!
I agree, putting the levels in a map with value
props greatly improves readability and is a lot closer in function to trueValues
than what I was proposing earlier.
I also agree with @pschumm that we should list all the values / categories rather than requiring a second enum
constraint. Per @peterdesmet 's point, although R's factor
allows the data to have values not specified in levels
, the "fixed" version fct
in forcats
does not. I definitely prefer the conservative / strict / explicit approach here.
(Side note: For boolean
types, do we consider it a validation error if a value comes up that is not contained in true/falseValues
? I cannot find mention in the spec…)
Where I still have reservations with @peterdesmet 's latest approach, however, is that from the field definition it is not immediately clear what the logical levels of the categorical should be. In the example, it at first looks like there are 6 logical levels, and it requires sorting through the labels to find that there are actually only 4, because two get collapsed.
I'd argue that R's factor
usage of labels
as a way to collapse levels is a transformation of the data, rather than a description of it. (And note that forcats
' more strict implementation fct
also does not allow collapsing via labels for this reason to encourage explicit use of fct_collapse
instead).
Therefore, I think we should require that labels
be unique, so that we always have a 1-1 correspondence between items in the categories
array, and logical levels of the resulting categorical:
{
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
{"value": 1, "label": "Male"},
{"value": 2, "label": "Man"},
{"value": 3, "label": "Female"},
{"value": 4, "label": "Lady"},
{"value": 5, "label": "Nonbinary"},
],
"ordered": false,
"missingValues": ["NA", "6"]
}
]
}
Yes, Male/Man and Female/Lady can be grouped, but they are still qualitatively different responses and therefore distinct logical levels. If a user wants to collapse those levels, they can do so in a subsequent transformation step.
I'll also note that another advantage of @peterdesmet's list-of-objects approach is that logical level objects can be extended via user defined properties – for example, the text of the item in the survey question:
{
"fields": [
{
"name": "gender",
"type": "categorical",
"description": "Select the statement you most agree with",
"categories": [
{"value": 1, "label": "Male", "text": "I identify as male"},
{"value": 2, "label": "Man", "text": "I identify as a man"},
{"value": 3, "label": "Female", "text": "I identify as female"},
{"value": 4, "label": "Lady", "text": "I identify as a lady"},
{"value": 5, "label": "Nonbinary": "text": "I identify as nonbinary"},
],
"ordered": false,
"missingValues": ["NA", "6"]
}
]
}
You'll notice I also put level 6 (Declined) in missingValues
. I think it should go here instead of categories
because it should not be considered one of the logical values of the categorical… it is a missing value instead. Yes, we lose the label, but given this current direction, I think we can safely make "missing labels" a separate proposal / discussion. (We will want "missing labels" to be available for all field types, not just categorical fields!)
And as @peterdesmet said, when no label
is given, implementations can just use value
. For example:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [
{ value: 1, label: "Poor" },
{ value: 2 },
{ value: 3 },
{ value: 4 },
{ value: 5, label: "Excellent" }
],
"ordered": true,
}
]
}
Would be imported in R as factor(c("Poor", "2", "3", "4", "Excellent"))
And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!).
I hope we can make that happen one of these days! I really appreciate everyone's engagement on this as well :)
The example I gave above was intentional; for example, this is how REDCap exports data by default (i.e., categorical variables get exported with their numeric codes, except for defined missing values which are reprsented by their labels).
Ah, good to know! Then for the example I just gave we'd have:
{
"fields": [
{
"name": "physical_health",
"type": "categorical",
"values": [
{ value: 1, label: "Poor" },
{ value: 2 },
{ value: 3 },
{ value: 4 },
{ value: 5, label: "Excellent" }
],
"ordered": true,
"missingValues": ["Don't know","Refused","Not applicable"]
}
]
}
For a REDCap export. That looks quite nice.
👍 With the minor correction that values
is categories
.
All of this looks good to me. I agree with @khusmann that collapsing is a transformation; while Stata permits you to label two different integer values with the same label, those are still treated as separate analytically and appear separately in output (just with the same label).
Note that in the discussion above, both @khusmann and @peterdesmet are using field-specific missingValues
(which at present are not part of the spec). This reinforces my original contention that the issue of field-specific missingValues
is closely related to efficient description of categorical variables (in fact, I had originally included it in the pattern but then dropped it to simplify things and because it had already been proposed as a separate pattern). The proposal here would still work without field-specific missingValues
, but not quite as well. So I'd like to put in a plug for tackling #861 too.
We already have
constraints.enum
to validate unexpected values, but it might be good to include that functionality incategories
as well. I'm a bit on the fence if it is a good design decision to have bothconstraints.enum
andcategories
as methods of defining that or if we should reserve validation forconstraints.enum
only.
I don't know if this addresses your comment above, but other field types already invoke validation (e.g., a field with type integer
has to be an integer, even without specifying any further constraints
). IMO the value of what we're doing here is defining a categorical variable as a first class type (not just a string
with constraints), so for me at least, it doesn't seem inconsistent for it to invoke validation. Perhaps I'm missing something here.
Regarding field.missingValues
: there is a PR for v2 now: https://github.com/frictionlessdata/datapackage/pull/24
This reinforces my original contention that the issue of field-specific missingValues is closely related to efficient description of categorical variables The proposal here would still work without field-specific missingValues, but not quite as well.
Agreed! I am also strongly in favor of field-specific missingValues
for these reasons.
IMO the value of what we're doing here is defining a categorical variable as a first class type (not just a string with constraints), so for me at least, it doesn't seem inconsistent for it to invoke validation.
Also agree. Well said.
One more thought – we want to offer the shortcut of string levels? So:
{
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
"Male",
"Man",
"Female",
"Lady",
"Nonbinary"
],
"ordered": false,
"missingValues": ["NA", "Declined"]
}
]
}
would be syntactic sugar for:
{
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
{ "value": "Male" },
{ "value": "Man" },
{ "value": "Female" },
{ "value": "Lady" },
{ "value": "Nonbinary" }
],
"ordered": false,
"missingValues": ["NA", "Declined"]
}
]
}
In summary, this would make the complete type signature of the proposed field as follows:
type CategoricalField = {
name: string,
title?: string,
description?: string,
example?: string,
format?: "default",
type: "categorical",
categories: ({ value: string | number, label?: string } | string)[],
ordered?: boolean,
constraints?: {
"required"?: boolean,
"unique"?: boolean
},
missingValues?: string[]
}
Thanks for suggesting this @khusmann; I was so focused on the other details that it didn't even occur to me. Indeed, as I think I've mentioned before, this is my preferred way to distribute data (i.e., labels rather than integer codes) since it makes them useable with the broadest range of software. So this simplified specification would be very nice (not to mention very readable). I definitely favor including this option.
My 2 cents:
categories
and not requiring constraints.enum
to do that. As in: all values in the data should be present in categories
or it is invalid.Correction: I guess the shortcut is a case of a union type (array of strings vs array of objects). Other than my readability concerns, it does allow backward compatibility for missingValues.
In general, union types offer a lot of flexibility to keep things backward compatible and are often elegant (e.g. not having to add a new roles
property over role
). I think we need a higher-level discussion on whether we want to allow or discourage these (#873) before we can move this further.
In general, union types offer a lot of flexibility to keep things backward compatible and are often elegant (e.g. not having to add a new roles property over role). I think we need a higher-level discussion on whether we want to allow or discourage these (#873) before we can move this further.
Just added comments to #873 re: union types. There, I argue that the kind of "union type syntactic sugar" I'm proposing here should be considered on a case-by-case basis. I would argue that the proposed syntactic sugar here is worth using because:
It's a special case, so won't confuse other parts of the spec
Actually makes things more consistent broadly across the spec, because it would match the signature of missingValues
(assuming we accept #880 as well)
As @pschumm mentioned, categorical fields with meaningful string physical values are extremely common, probably more so than encoded categoricals. So I think providing an easy shortcut to this definition makes a lot of sense.
With this proposal (I'd call it inline-categories), the list of categories must be given on each field again and again. If categories have many values and/or if they are used in mutliple fields, it may make sense to allow referencing
{
"name": "suportStatement1",
"type": "categorical",
"categories": "agreementLevel"
}, {
"name": "suportStatement2",
"type": "categorical",
"categories": "agreementLevel"
}
and elsewhere
{
"categoryTypes": {
"agreementLevel": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
]
}
}
The value of categories
could also be an URI to reference an external large list of allowed values.
See Codes and Codelists in Avram schema language for the same idea (codes == categories).
@nichtich Agreed -- I think that's exactly the direction we want to go, and "inline-categories" gives us the first step.
If categories have many values and/or if they are used in mutliple fields, it may make sense to allow referencing
Recall that #888 covers similar ground; we might want to consider both ideas together.
This pattern includes two additions to the Table Schema to facilitate working with categorical data across a broad range of commonly-used, analytic software. It is fully backward compatible, and would substantially increase the usability of Frictionless data packages in biomedical, epidemiological and social research. It has been discussed and revised extensively here.