frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
498 stars 113 forks source link

Introduce a codeList property to the field descriptor #888

Closed djvanderlaan closed 3 weeks ago

djvanderlaan commented 8 months ago

We work a lot with surveydata and administrative data. In both cases files often contain fields where the values in the field should come from a limited list of possible values. These values also have a specific meaning. Some examples:

Properties of these codes:

We are aware of the suggestion in issue #875 for supporting categories which is the same issue/problem. However, there are a few 'wishes' that are not covered by the suggestion in that issue and we believe the suggestion below is also easier to implement.

What we would like/need:

What we suggest:

This has a number of advantages:

  1. DataResources allow for inline data in the data property; for files in the DataPackage using the path property with a relative path and for external data using the path property with a URL.
  2. Furthermore, we can reuse properties like name, descriptor source and license and schema to describe the code list.
  3. We can use all of the properties and tooling that are available already for DataResources. This makes implementation less work. Indicating a codelist for a field can already be useful without specific tooling for codelists. The user can see that the field uses a codelist and can manually read the corresponding data resource using currently existing tooling.

Code List Resource We don't yet have a concrete suggestion as to what should be in the dataset containing the code list and what format this dataset should have. We currently have an implementation that assumes that the first column in the dataset contains the codes and the second the labels of the codes. This is, however, minimal functionality. Some thoughts:

Example Possible example with both a codelist in a file and inline data:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
            "codeList": "codelist-regions"
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level"
          }
        ]
      }
    }, {
      "name": "codelist-regions",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "string"
          }, { 
            "name": "name",
            "type": "string"
          }, {
            "name": "parent",
            "type": "string"
          } 
        ]
      },
      "path": "codelist-regions.csv"
    }, {
      "name": "codelist-edu_level",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "integer"
          }, { 
            "name": "name",
            "type": "string"
          }
        ]
      },
      "data": [
        {"code": 1, "name": "Low education"},
        {"code": 2, "name": "Medium education"},
        {"code": 3, "name": "High education"}
      ]
    }
  ]
}

@fomcl

fomcl commented 8 months ago

I like this proposal. I have, however, two suggestions. Both are inspired by Swagger's $ref notation

External codeLists

A reference to an external codeList may be made using JSON Reference syntax (RFC3986) and JSON Pointer syntax (RFC6901)

Example:

...
"path": "http://path/to/your/frictionless.json#/codelist-edu_level"
...

Non-basic codeLists

Jan's example assumes a "code" and a "label" in the first and second column in the file, respectively. This will work in many cases and it is easy and clean to specify this. However, we become more flexible by allowing an optional codeListSpec attribute that has codeField and labelField keys.

Tabular data

The codeField and labelField attributes take either a column name (string) or a column number (int).

Example:

...
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level",
            "codeListSpec":  {"codeField":  "MyCode", 
                                           "labelField": "TheLabel"}
          }
...

Non-tabular data (json, xml)

The codeField and labelField keys also use JSON Pointer notation to specify the codes and labels in a hierarchical json or xml data source. This source may typically be the response of an HTTP request to a REST API.

Example:

"path": "http://some-server.com/api/v1/data/#common/prefix",
...
...
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level",
            "codeListSpec":  {"codeField":  "/bar/code", 
                                          "labelField": "/foo/label"}
          }

...

Complex data

What about other data sources, like the more complicated ICD-10? In cases like this, a (custom) mimetype/mediatype (e.g. application/vnd.sdmx.generic+xml;version=2.1) might be the only thing we can do in terms of the specification. This indicates that it's up to the client tool how to interpret that data source.

khusmann commented 8 months ago

Thanks for jumping into the conversation here! Coded categorical data is something I'm very interested in for frictionless as well.

I think the functionality you're after here might be more generally modeled / solved by indicating primaryKey / foreignKey relationships between tables. This would have the added benefit of having these relationships be exportable to formats like SQL.

For example, the example you include could be modeled like this:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
          }, { 
            "name": "edu_level",
            "type": "integer",
          }
        ],
        "primaryKey": ["id"],
        "foreignKeys": [
          {
            "fields": "place_of_residence",
            "reference": {
              "resource": "codelist-regions",
              "fields": "code"
            }
          }, {
            "fields": "edu_level",
            "reference": {
              "resource": "codelist-edu_level",
              "fields": "code"
            }
          }
        ]
      }
    }, {
      "name": "codelist-regions",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "string"
          }, { 
            "name": "name",
            "type": "string"
          }, {
            "name": "parent",
            "type": "string"
          } 
        ],
        "primaryKey": ["code"],
        "foreignKeys": [
            "fields": "parent",
            "reference": {
              "resource": "",
              "fields": "code"
            }
        ]
      },
      "path": "codelist-regions.csv",
    }, {
      "name": "codelist-edu_level",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "integer"
          }, { 
            "name": "name",
            "type": "string"
          }
        ],
       "primaryKey": ["code"]
      },
      "data": [
        {"code": 1, "name": "Low education"},
        {"code": 2, "name": "Medium education"},
        {"code": 3, "name": "High education"}
      ]
    }
  ]
}

I think this solves almost all the requirements you listed above:

✔️ Possibility to indicate that a given field should use values/codes from a given list. ❌ It should be possible to store this list in the datapackage meta data (datapackage.json) itself ✔️ Have the codes in a file as large lists of codes make the metadata too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally. ✔️ It should preferably be possible to define hierarchies in the codes. (Notice the self reference in the region codes!)

It additionally has the benefits of allowing for more flexible naming of fields, being SQL compatible, as mentioned above, and leveraging all of the existing validation built around primary & foreign keys.

For the item that it doesn't address (The simple case of using values / codes inline when they are not long & do not have other complex relationships), I think the current categorical proposal #875 is still a good solution:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
          }, { 
            "name": "edu_level",
            "type": "categorical",
            "categories": [
              {"value": 1, "label": "Low education"},
              {"value": 2, "label": "Medium education"},
              {"value": 3, "label": "High education"}
            ]
          }
        ],
      }
    }
  ]
}

The advantage here of using a specific categorical type, is for its usability in a wide range of statistical software, e.g. R, SAS, SPSS, Stata, etc. (Whereas more complex & interrelated code lists are better modeled in the context of a database, with the primaryKey / foreignKey references)

Are there any situations I'm missing here that a combination of categorical field types and resource references would not solve?

khusmann commented 8 months ago

(Tagging @pschumm and @peterdesmet for their thoughts as well)

djvanderlaan commented 8 months ago

Thanks for the comments. We have also thought about using the foreignKeys property. The main issue I have with using the foreignKeys property, is that it is the property of the field and not the data set that a field should have codes from a code list/the field is a categorical variable. Therefore, I feel that this should be a property of the field descriptor.

Also in the current implementation/description it not really described what the foreign key should be used for. As far as I understand it just indicates another dataset that shares a key with the current data set. It could be a code list, but also something else completely. When working with code lists from R or python you will probably want to convert to factor, but then the code needs to now that it is a code list. This could be solved by adding something like a relation property indicating the relation between the data sets.

Using both foreignKey and categories for basically the same thing also makes thing more complicated for tools and users. The sized of code lists are on a continuum: we have lists with a few codes to lists with thousands of codes. And it is not really possible to determine when and if a user wants to use a variable as a categorical variable or as is. I have run plenty of regressions with e.g. municipality as categorical variable (approx 350 categories); but also with educational level as numeric variable (we use numeric codes; that are ordered; which is fine when using tree-methods). The choice of putting the codes in foreignKeys or in categories then seems a bit arbitrary.

khusmann commented 8 months ago

Thanks for your clarifications. I think my biggest concern regarding the codeList approach is that it effectively duplicates a bunch of the functionality of primaryKey and foreignKeys, rather than extending them.

In the current implementation/description it not really described what the foreign key should be used for. It could be a code list, but also something else completely.

Right – and I think that's a good thing in this case, because when you're referencing codes in this way, there's a lot of different ways an implementation might want to follow these relationships. For example, for a dropdown selection widget for codes, you might want to list code abbreviations, but when you're populating a larger table, you might want to grab their full descriptions, for example. Which field an implementation will want to use to represent the level depends on the application, the current usage context, and the available properties of the levels…

Using both foreignKey and categories for basically the same thing also makes thing more complicated for tools and users.

I disagree on this point – I think there's a strong precedent for having both flat categorical value types and richer categorical level entities via table relationships. For example, DuckDB has categorical types as well as the ability to define primaryKey / foreignKey relationships. Categorical types are used when your levels are simple values, and relationships are used when you're wanting to represent your levels as more complex & general entities (that is, objects with multiple properties).

Perhaps we could get the desired functionality here by extending the foreignKeys definitions so they could receive a default human-readable label when importing into a flat value type (e.g. an R factor)…? maybe we could add a defaultLabelField or something to that effect? Actually, what about this instead:

{
  "name": "highest_education",
  "resources": [
    {
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "integer"
          },
          {
            "name": "place_of_residence",
            "type": "string"
          },
          {
            "name": "edu_level",
            "type": "categorical",
            "categories": [
              { "value": 1, "label": "LOW_EDU" },
              { "value": 2, "label": "MED_EDU" },
              { "value": 3, "label": "HIGH_EDU" }
            ]
          }
        ],
        "foreignKeys": {
          "fields": "edu_level",
          "reference": {
            "resource": "codelist-edu_level",
            "fields": "code"
          }
        }
      }
    },
    {
      "name": "codelist-edu_level",
      "schema": {
        "fields": [
          {
            "name": "code",
            "type": "integer"
          },
          {
            "name": "name",
            "type": "string"
          },
          {
            "name": "description",
            "type": "string"
          },
          {
            "name": "field_color",
            "type": "categorical",
            "categories": ["red", "green", "blue"]
          }
        ],
        "primaryKey": ["code"]
      },
      "data": [
        {
          "code": 1,
          "name": "Low education",
          "description": "Primary education",
          "field_color": "red"
        },
        {
          "code": 2,
          "name": "Medium education",
          "description": "Secondary education",
          "field_color": "green"
        },
        {
          "code": 3,
          "name": "High education",
          "description": "Tertiary education",
          "field_color": "blue"
        }
      ]
    }
  ]
}

By using the categorical field type AND foreignKeys together, the levels can act as entities with multiple properties for each level (and the ability to build hierarchies by following foreign key references across level entities), but when importing the field into statistical software like R as a flat value we have labels defined for the factor levels. That's actually pretty nice.

It's still rough for fields with 100s of levels, or when many variables that share the same categorical scales -- for those cases what if we just let the categories field reference an external json with the array of value-label combos, like @peterdesmet mentioned on the last community call? Would that cover all of the use cases discussed so far?

pschumm commented 8 months ago

Having spent a fair amount of time myself on #875 (and its precursor pattern) and believing that the extended discussion that led to it improved it considerably, I was initially not very keen to see this proposed as an alternative. However, I read this proposal and thought about it carefully, and I must admit that it has grown on me. I too use large code lists, as well as those with hierarchical structure (e.g., ICD9/10 codes, Multum drug codes, etc.), so I can appreciate those use cases. And I agree with @djvanderlaan that these are conceptually field-level properties as opposed to properties of the data resource, so using the foreignKeys construct to represent them seems inappropriate.

That said, IIUC, this proposal addresses just two features that #875 does not provide:

  1. Handling very large code lists (e.g., hundreds of values or more) that would be extremely awkward to handle using the syntax proposed in #875; and
  2. Representing additional information beyond the set of codes itself, such as a hierarchical structure among them (or, for that matter, any arbitrary information about the codes).

Item (1) can be addressed using a JSON reference (as noted above by @fomcl and previously by @peterdesmet). Item (2) is a different matter, and I rather like the syntax proposed by @djvanderlaan above. I can see how that could be used to accomplish everything that #875 does.

But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this. Thus, while I can appreciate the elegance in a more general solution that can accomplish everything that #875 does as well as Item (2) above, I do not believe that it is justified in this case. Specifically, IMO the syntax proposed in #875 (and especially the syntactic sugar described by @khusmann at the bottom of that discussion) would be considerably simpler and more intuitive for the most common use cases of categorical variables.

In sum, I agree with @khusmann above that addressing Item (2) with a separate strategy is warranted in this case, and I would support @djvanderlaan's proposal above for that purpose.

djvanderlaan commented 8 months ago

Thanks for the reply.

Being able to store code lists externally (in a separate file or using a url) is not just beneficial for large code lists. It also makes it easier for different datasets to use the same codelists. Many organisations have coordinated code lists for various subjects. Using a data resource allows one to link to such lists.

I can see the amount of effort that went into #875. It is perhaps also interesting to tell a bit about the history of the current proposal: We initially had an implementation that is similar to the one proposed in issue #875 (see https://github.com/frictionlessdata/specs/issues/156#issuecomment-67377176) . In order to allow for large code lists and reuse of code lists we wanted to allow that field to also refer to files. After discussion we concluded that it then made more sense to refer to a data resource instead of a file directly. And since a data resource allows for inline data anyway, we could then always refer to a data resource.

Edit: I removed a bit here as I didn't completely managed to formulate that the way I wanted. The gist is that I do not see a fundamental difference between a variable with a codelist and a categorical variable. From a user perspective the codelist proposal doesn't seem to be more complex (e.g. see demo: https://gist.github.com/djvanderlaan/f898bd8b4416dfe6157a7c45c616eecb )

I do like the 'syntactic sugar' proposed in the other proposal. This is difficult to do with the codelist proposal

{
  "name": "gender",
  "type": "string",
  "categories": ["Female", "Male", "Other"]
} 

Would it be possible to merge both proposals? The categories field can be

  1. An array of objects
  2. An array of strings (the syntactic sugar)
  3. The name of a dataresource (I am not completely sure if (1) is really needed when you can have inline data in (3)). Or is the categories field then a union of too many types?
khusmann commented 8 months ago

But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this.

Agreed. #875 allows frictionless to be a drop in replacement for the proprietary formats currently dominating a bunch of scientific fields. I think it should stand as it is, and we should try to figure out how to work these features in as a somewhat separate concern.

It also makes it easier for different datasets to use the same codelists.

And similarly allows the same dataset to use the same codelist many times -- that's something I could really make use of in my data, along with the ability to store more metadata about categorical levels.

Would it be possible to merge both proposals? The categories field can be

An array of objects An array of strings (the syntactic sugar) The name of a dataresource

I like this direction. I would prefer though if we use an object instead of just the data resource name, so we can explicitly assign which fields the labels and values should come from. Something like this:

{
  "name": "edu_level",
  "type": "categorical",
  "categories": {
    "resource": "codelist-edu_level",
    "valueField": "code",
    "labelField": "name"
  }
}

I still wish we could connect this to foreign keys somehow, because it's an existing practice in data warehousing for automatically recognizing and traversing properties of hierarchical categorical structures – see zillion for a good example. But I agree, it's nice to have it specified in the field itself rather than be a table-level prop. I suppose we can leave it up to implementations to recognize this as a foreign key situation.

peterdesmet commented 7 months ago

Very late to this party. While I understand that it can be useful to represent a code list as a Data Resource, I consider code lists more similar to Table Schemas. A schema can:

  1. Be included verbosely in datapackage.json
  2. Be referenced with a path or URL

Similarly, categories (as proposed in #48) could be:

  1. Included verbosely as an array of objects
  2. Included verbosely as an array of strings
  3. Be referenced with a path or URL (I've just suggested this in #48):
    • This file could be local or remote
    • This file would need to be JSON
    • This file would need to have the fields value and label
    • This file could have other properties, like hierarchy
    • This file could still be described as a Data Resource if you want to

Personally, I'm not in fan of "categories could be a Data Resource". It's quite bespoke and thus requires more software implementation work. It would also require an alternative approach (than #48) to handling missing values that are included in the codeList.

djvanderlaan commented 6 months ago

An advantage of using a data resource instead of a separate json file, is that a data resource also has functionality for storing additional meta data for the code list. For example, in my file I could could have one variable that uses NACE to code companies. In the data resource for the categories of that field, I can then indicate using the title, description, author, license and source fields which specific version of NACE I am using, the license (which might be different from that of the data set itself) and also refer to the original author of that NACE classification. Without that, I would just have a list of codes and labels and would not even know I was looking at NACE codes (unless you put all that information in the description field of the field which then becomes quite overloaded).

Also, https://github.com/frictionlessdata/datapackage/pull/48#issuecomment-2062118008 mentions that it would be useful to be able to have one json file with a collection of definitions for categories. We already have an object to store a list of 'resources', namely a data package. If we can refer to a specific data resource in a datapackage, we can could store all categories definitions (inline) in one data package.