Closed djvanderlaan closed 3 weeks ago
I like this proposal. I have, however, two suggestions. Both are inspired by Swagger's $ref notation
A reference to an external codeList may be made using JSON Reference syntax (RFC3986) and JSON Pointer syntax (RFC6901)
Example:
...
"path": "http://path/to/your/frictionless.json#/codelist-edu_level"
...
Jan's example assumes a "code" and a "label" in the first and second column in the file, respectively. This will work in many cases and it is easy and clean to specify this. However, we become more flexible by allowing an optional codeListSpec
attribute that has codeField
and labelField
keys.
The codeField
and labelField
attributes take either a column name (string) or a column number (int).
Example:
...
}, {
"name": "edu_level",
"type": "integer",
"codeList": "codelist-edu_level",
"codeListSpec": {"codeField": "MyCode",
"labelField": "TheLabel"}
}
...
The codeField
and labelField
keys also use JSON Pointer notation to specify the codes and labels in a hierarchical json or xml data source. This source may typically be the response of an HTTP request to a REST API.
Example:
"path": "http://some-server.com/api/v1/data/#common/prefix",
...
...
}, {
"name": "edu_level",
"type": "integer",
"codeList": "codelist-edu_level",
"codeListSpec": {"codeField": "/bar/code",
"labelField": "/foo/label"}
}
...
What about other data sources, like the more complicated ICD-10? In cases like this, a (custom) mimetype/mediatype (e.g. application/vnd.sdmx.generic+xml;version=2.1
)
might be the only thing we can do in terms of the specification. This indicates that it's up to the client tool how to interpret that data source.
Thanks for jumping into the conversation here! Coded categorical data is something I'm very interested in for frictionless as well.
I think the functionality you're after here might be more generally modeled / solved by indicating primaryKey
/ foreignKey
relationships between tables. This would have the added benefit of having these relationships be exportable to formats like SQL.
For example, the example you include could be modeled like this:
{
"name": "highest_education",
"resources": [
{
"name": "edulevel",
"format": "csv",
"mediatype": "text/csv",
"path": "edulevel.csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
}, {
"name": "place_of_residence",
"type": "string",
}, {
"name": "edu_level",
"type": "integer",
}
],
"primaryKey": ["id"],
"foreignKeys": [
{
"fields": "place_of_residence",
"reference": {
"resource": "codelist-regions",
"fields": "code"
}
}, {
"fields": "edu_level",
"reference": {
"resource": "codelist-edu_level",
"fields": "code"
}
}
]
}
}, {
"name": "codelist-regions",
"schema": {
"fields": [
{
"name": "code",
"type": "string"
}, {
"name": "name",
"type": "string"
}, {
"name": "parent",
"type": "string"
}
],
"primaryKey": ["code"],
"foreignKeys": [
"fields": "parent",
"reference": {
"resource": "",
"fields": "code"
}
]
},
"path": "codelist-regions.csv",
}, {
"name": "codelist-edu_level",
"schema": {
"fields": [
{
"name": "code",
"type": "integer"
}, {
"name": "name",
"type": "string"
}
],
"primaryKey": ["code"]
},
"data": [
{"code": 1, "name": "Low education"},
{"code": 2, "name": "Medium education"},
{"code": 3, "name": "High education"}
]
}
]
}
I think this solves almost all the requirements you listed above:
✔️ Possibility to indicate that a given field should use values/codes from a given list. ❌ It should be possible to store this list in the datapackage meta data (datapackage.json) itself ✔️ Have the codes in a file as large lists of codes make the metadata too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally. ✔️ It should preferably be possible to define hierarchies in the codes. (Notice the self reference in the region codes!)
It additionally has the benefits of allowing for more flexible naming of fields, being SQL compatible, as mentioned above, and leveraging all of the existing validation built around primary & foreign keys.
For the item that it doesn't address (The simple case of using values / codes inline when they are not long & do not have other complex relationships), I think the current categorical proposal #875 is still a good solution:
{
"name": "highest_education",
"resources": [
{
"name": "edulevel",
"format": "csv",
"mediatype": "text/csv",
"path": "edulevel.csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
}, {
"name": "place_of_residence",
"type": "string",
}, {
"name": "edu_level",
"type": "categorical",
"categories": [
{"value": 1, "label": "Low education"},
{"value": 2, "label": "Medium education"},
{"value": 3, "label": "High education"}
]
}
],
}
}
]
}
The advantage here of using a specific categorical type, is for its usability in a wide range of statistical software, e.g. R, SAS, SPSS, Stata, etc. (Whereas more complex & interrelated code lists are better modeled in the context of a database, with the primaryKey
/ foreignKey
references)
Are there any situations I'm missing here that a combination of categorical field types and resource references would not solve?
(Tagging @pschumm and @peterdesmet for their thoughts as well)
Thanks for the comments. We have also thought about using the foreignKeys
property. The main issue I have with using the foreignKeys
property, is that it is the property of the field and not the data set that a field should have codes from a code list/the field is a categorical variable. Therefore, I feel that this should be a property of the field descriptor.
Also in the current implementation/description it not really described what the foreign key should be used for. As far as I understand it just indicates another dataset that shares a key with the current data set. It could be a code list, but also something else completely. When working with code lists from R or python you will probably want to convert to factor, but then the code needs to now that it is a code list. This could be solved by adding something like a relation
property indicating the relation between the data sets.
Using both foreignKey
and categories
for basically the same thing also makes thing more complicated for tools and users. The sized of code lists are on a continuum: we have lists with a few codes to lists with thousands of codes. And it is not really possible to determine when and if a user wants to use a variable as a categorical variable or as is. I have run plenty of regressions with e.g. municipality as categorical variable (approx 350 categories); but also with educational level as numeric variable (we use numeric codes; that are ordered; which is fine when using tree-methods). The choice of putting the codes in foreignKeys
or in categories
then seems a bit arbitrary.
Thanks for your clarifications. I think my biggest concern regarding the codeList
approach is that it effectively duplicates a bunch of the functionality of primaryKey
and foreignKeys
, rather than extending them.
In the current implementation/description it not really described what the foreign key should be used for. It could be a code list, but also something else completely.
Right – and I think that's a good thing in this case, because when you're referencing codes in this way, there's a lot of different ways an implementation might want to follow these relationships. For example, for a dropdown selection widget for codes, you might want to list code abbreviations, but when you're populating a larger table, you might want to grab their full descriptions, for example. Which field an implementation will want to use to represent the level depends on the application, the current usage context, and the available properties of the levels…
Using both foreignKey and categories for basically the same thing also makes thing more complicated for tools and users.
I disagree on this point – I think there's a strong precedent for having both flat categorical value types and richer categorical level entities via table relationships. For example, DuckDB has categorical types as well as the ability to define primaryKey
/ foreignKey
relationships. Categorical types are used when your levels are simple values, and relationships are used when you're wanting to represent your levels as more complex & general entities (that is, objects with multiple properties).
Perhaps we could get the desired functionality here by extending the foreignKeys
definitions so they could receive a default human-readable label when importing into a flat value type (e.g. an R factor)…? maybe we could add a defaultLabelField
or something to that effect? Actually, what about this instead:
{
"name": "highest_education",
"resources": [
{
"name": "edulevel",
"format": "csv",
"mediatype": "text/csv",
"path": "edulevel.csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
},
{
"name": "place_of_residence",
"type": "string"
},
{
"name": "edu_level",
"type": "categorical",
"categories": [
{ "value": 1, "label": "LOW_EDU" },
{ "value": 2, "label": "MED_EDU" },
{ "value": 3, "label": "HIGH_EDU" }
]
}
],
"foreignKeys": {
"fields": "edu_level",
"reference": {
"resource": "codelist-edu_level",
"fields": "code"
}
}
}
},
{
"name": "codelist-edu_level",
"schema": {
"fields": [
{
"name": "code",
"type": "integer"
},
{
"name": "name",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "field_color",
"type": "categorical",
"categories": ["red", "green", "blue"]
}
],
"primaryKey": ["code"]
},
"data": [
{
"code": 1,
"name": "Low education",
"description": "Primary education",
"field_color": "red"
},
{
"code": 2,
"name": "Medium education",
"description": "Secondary education",
"field_color": "green"
},
{
"code": 3,
"name": "High education",
"description": "Tertiary education",
"field_color": "blue"
}
]
}
]
}
By using the categorical field type AND foreignKeys
together, the levels can act as entities with multiple properties for each level (and the ability to build hierarchies by following foreign key references across level entities), but when importing the field into statistical software like R as a flat value we have labels defined for the factor levels. That's actually pretty nice.
It's still rough for fields with 100s of levels, or when many variables that share the same categorical scales -- for those cases what if we just let the categories
field reference an external json with the array of value-label combos, like @peterdesmet mentioned on the last community call? Would that cover all of the use cases discussed so far?
Having spent a fair amount of time myself on #875 (and its precursor pattern) and believing that the extended discussion that led to it improved it considerably, I was initially not very keen to see this proposed as an alternative. However, I read this proposal and thought about it carefully, and I must admit that it has grown on me. I too use large code lists, as well as those with hierarchical structure (e.g., ICD9/10 codes, Multum drug codes, etc.), so I can appreciate those use cases. And I agree with @djvanderlaan that these are conceptually field-level properties as opposed to properties of the data resource, so using the foreignKeys
construct to represent them seems inappropriate.
That said, IIUC, this proposal addresses just two features that #875 does not provide:
Item (1) can be addressed using a JSON reference (as noted above by @fomcl and previously by @peterdesmet). Item (2) is a different matter, and I rather like the syntax proposed by @djvanderlaan above. I can see how that could be used to accomplish everything that #875 does.
But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this. Thus, while I can appreciate the elegance in a more general solution that can accomplish everything that #875 does as well as Item (2) above, I do not believe that it is justified in this case. Specifically, IMO the syntax proposed in #875 (and especially the syntactic sugar described by @khusmann at the bottom of that discussion) would be considerably simpler and more intuitive for the most common use cases of categorical variables.
In sum, I agree with @khusmann above that addressing Item (2) with a separate strategy is warranted in this case, and I would support @djvanderlaan's proposal above for that purpose.
Thanks for the reply.
Being able to store code lists externally (in a separate file or using a url) is not just beneficial for large code lists. It also makes it easier for different datasets to use the same codelists. Many organisations have coordinated code lists for various subjects. Using a data resource allows one to link to such lists.
I can see the amount of effort that went into #875. It is perhaps also interesting to tell a bit about the history of the current proposal: We initially had an implementation that is similar to the one proposed in issue #875 (see https://github.com/frictionlessdata/specs/issues/156#issuecomment-67377176) . In order to allow for large code lists and reuse of code lists we wanted to allow that field to also refer to files. After discussion we concluded that it then made more sense to refer to a data resource instead of a file directly. And since a data resource allows for inline data anyway, we could then always refer to a data resource.
Edit: I removed a bit here as I didn't completely managed to formulate that the way I wanted. The gist is that I do not see a fundamental difference between a variable with a codelist and a categorical variable. From a user perspective the codelist proposal doesn't seem to be more complex (e.g. see demo: https://gist.github.com/djvanderlaan/f898bd8b4416dfe6157a7c45c616eecb )
I do like the 'syntactic sugar' proposed in the other proposal. This is difficult to do with the codelist proposal
{
"name": "gender",
"type": "string",
"categories": ["Female", "Male", "Other"]
}
Would it be possible to merge both proposals? The categories field can be
categories
field then a union of too many types? But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this.
Agreed. #875 allows frictionless to be a drop in replacement for the proprietary formats currently dominating a bunch of scientific fields. I think it should stand as it is, and we should try to figure out how to work these features in as a somewhat separate concern.
It also makes it easier for different datasets to use the same codelists.
And similarly allows the same dataset to use the same codelist many times -- that's something I could really make use of in my data, along with the ability to store more metadata about categorical levels.
Would it be possible to merge both proposals? The categories field can be
An array of objects An array of strings (the syntactic sugar) The name of a dataresource
I like this direction. I would prefer though if we use an object instead of just the data resource name, so we can explicitly assign which fields the labels and values should come from. Something like this:
{
"name": "edu_level",
"type": "categorical",
"categories": {
"resource": "codelist-edu_level",
"valueField": "code",
"labelField": "name"
}
}
I still wish we could connect this to foreign keys somehow, because it's an existing practice in data warehousing for automatically recognizing and traversing properties of hierarchical categorical structures – see zillion for a good example. But I agree, it's nice to have it specified in the field itself rather than be a table-level prop. I suppose we can leave it up to implementations to recognize this as a foreign key situation.
Very late to this party. While I understand that it can be useful to represent a code list as a Data Resource, I consider code lists more similar to Table Schemas. A schema
can:
datapackage.json
Similarly, categories
(as proposed in #48) could be:
value
and label
Personally, I'm not in fan of "categories
could be a Data Resource". It's quite bespoke and thus requires more software implementation work. It would also require an alternative approach (than #48) to handling missing values that are included in the codeList.
An advantage of using a data resource instead of a separate json file, is that a data resource also has functionality for storing additional meta data for the code list. For example, in my file I could could have one variable that uses NACE to code companies. In the data resource for the categories of that field, I can then indicate using the title, description, author, license and source fields which specific version of NACE I am using, the license (which might be different from that of the data set itself) and also refer to the original author of that NACE classification. Without that, I would just have a list of codes and labels and would not even know I was looking at NACE codes (unless you put all that information in the description field of the field which then becomes quite overloaded).
Also, https://github.com/frictionlessdata/datapackage/pull/48#issuecomment-2062118008 mentions that it would be useful to be able to have one json file with a collection of definitions for categories. We already have an object to store a list of 'resources', namely a data package. If we can refer to a specific data resource in a datapackage, we can could store all categories definitions (inline) in one data package.
We work a lot with surveydata and administrative data. In both cases files often contain fields where the values in the field should come from a limited list of possible values. These values also have a specific meaning. Some examples:
Properties of these codes:
We are aware of the suggestion in issue #875 for supporting categories which is the same issue/problem. However, there are a few 'wishes' that are not covered by the suggestion in that issue and we believe the suggestion below is also easier to implement.
What we would like/need:
datapackage.json
) itself or have the codes in a file as large lists of codes make the meta data too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally.What we suggest:
codeList
to the FieldDescriptor. This MUST be string with the name of a DataResource in the DataPackage (if there is a syntax for referencing to a DataResource in an external DataPackage, this would also be valid).This has a number of advantages:
Code List Resource We don't yet have a concrete suggestion as to what should be in the dataset containing the code list and what format this dataset should have. We currently have an implementation that assumes that the first column in the dataset contains the codes and the second the labels of the codes. This is, however, minimal functionality. Some thoughts:
code
(orid
) andname
; with optional columnsdescription
,locale
andparent
(indicating missing values seems to be missing; I can check with SDMX experts how this is handled; probably using custom annotations).Example Possible example with both a codelist in a file and inline data:
@fomcl