frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 109 forks source link

Add pattern - Table Schema: Relationship between Fields #859

Closed loco-philippe closed 3 months ago

loco-philippe commented 6 months ago

This pull request adds a pattern to express relationships between Fields and define associated validation tool.

This PR follows the issue #803 and is proposed by @roll.

Several option are proposed for the associated Descriptor.

Thank you very much for your review and comments.

Note: I also added a Table of contents

roll commented 6 months ago

Thanks @loco-philippe!

I'm going to request a review from Working Group on Monday

loco-philippe commented 6 months ago

Thank you @roll !

Is there enough information or should I add something ?

pschumm commented 5 months ago

First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:

1) Of all three options, I believe the most appropriate would be a new property in the table descriptor since, as the author notes, a relationship between two or more fields is not a property of an individual field per se.

2) I don't find the examples given to be very compelling. For example, I wouldn't typically store quarter if I had month, since quarter can be easily calculated from month. And if you do want to include it in the data, that relationship (i.e., what you are referring to as derived) can be easily validated (and clearly documented) by specifying a row constraint. Similarly, I'm not sure that your "nicknames example" truly demonstrates a coupled relationship, as the relationship between names and nicknames is many-to-one (i.e., two names may share the same nickname). Finally, your example of a crossed relationship not only represents a conceptual relationship between year and semester, but also a property of the specific dataset (i.e., that it includes an observation for every possible year $\times$ semester combination).

3) More generally, IIUC, what we are really talking about here is validating many-to-one (i.e., derived), one-to-one (i.e., coupled), and many-to-many (with all possible combinations present only once, or fully crossed) relationships between fields. And these can currently be accomplished with a row constraint (which not only verifies the cardinality of the derived relationship but also its precise functional form), by moving one field to a separate table and using the existing primaryKey and foreignKeys properties, or by providing a custom validation check.

Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases). That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks.

loco-philippe commented 5 months ago

comment @pschumm

First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:

Thank you also for this relevant and constructive feedback !

  1. Of all three options, I believe the most appropriate would be a new property in the table descriptor since, as the author notes, a relationship between two or more fields is not a property of an individual field per se.

I agree with this analysis.

  1. I don't find the examples given to be very compelling. For example, I wouldn't typically store quarter if I had month, since quarter can be easily calculated from month. And if you do want to include it in the data, that relationship (i.e., what you are referring to as derived) can be easily validated (and clearly documented) by specifying a row constraint. Similarly, I'm not sure that your "nicknames example" truly demonstrates a coupled relationship, as the relationship between names and nicknames is many-to-one (i.e., two names may share the same nickname). Finally, your example of a crossed relationship not only represents a conceptual relationship between year and semester, but also a property of the specific dataset (i.e., that it includes an observation for every possible year × semester combination).

Indeed, the examples chosen are too basic and unrealistic (I used another example in the patterns file but which can also be considered unrealistic). On the other hand, also in the patterns file (chapter "principles"), I included a link to a real (simplified) open-data example concerning Electric vehicle charging infrastructures (stations) .

  1. More generally, IIUC, what we are really talking about here is validating many-to-one (i.e., derived), one-to-one (i.e., coupled), and many-to-many (with all possible combinations present only once, or fully crossed) relationships between fields. And these can currently be accomplished with a row constraint (which not only verifies the cardinality of the derived relationship but also its precise functional form), by moving one field to a separate table and using the existing primaryKey and foreignKeys properties, or by providing a custom validation check.

Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases).

The french IRVE open-data (used for the simplified example shown above) is interesting to look at:

That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks.

Regarding the subject as a whole, in my opinion there are several points to consider:

pschumm commented 5 months ago

I have no comments on the choice of a solution. On the other hand, I am interested in the solution you propose and I would like to do some tests (is the V4 python framework sufficient ? Do you have other recommendations or examples that I can use ?)

If you can, I would focus on v5 since it's been out for a while now. My suggestion would be to write a plugin to provide a new "relationship" check that takes three arguments: name of field 1, name of field 2, and type of relationship (i.e., derived, coupled or crossed). It might be helpful to look at the row-constraint check as an example.

Note that you will most likely want to specify all of validate_start() (to verify that both fields are present in the resource), validate_row() (to build up your two lists of values as the data are streamed), and validate_end() (to do the appropriate check based on the specified relationship).

Once your new check is available, you may then specify it in your data package via an inquiry. This makes it easy to validate the data, regardless of whether you're working with the CLI or the Python API. And as I said above, I think it would then be more appropriate to add this check to the list of built-in checks, rather than to add it to the official table schema.

If you need help doing this, let me know (but I'm afraid I'm tied up with another project at the moment). And I defer to @roll if he wants to suggest an alternative strategy.

loco-philippe commented 5 months ago

Thank you @pschumm for these detailed answers !

I also won't have much time to work on this implementation in January but I think I can work on it in February.

On the other hand, @roll, in terms of methodology, how do we create a data schema from a data model (if the notion of relationship does not exist) and then validate that a dataset respects the constraints expressed in the data model (it would be difficult to explain that Table Schema processes the data included in the Fields but does not concern itself with the structure of the Dataset)?

(of course, from my point of view, the answer is to include the notion of relationship in a data schema and use a corresponding validation tool!)

loco-philippe commented 5 months ago

Here is a first summary of the discussions:

@roll next steps ?

peterdesmet commented 5 months ago

This is an interesting proposal. @loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and https://github.com/tdwg/camtrap-dp/issues/32? E.g. for data:

observationType,scientificName
animal,Vulpes vulpes
blank,

Express that scientificName is required when observationType == animal.

loco-philippe commented 5 months ago

This is an interesting proposal.

Thank you @peterdesmet !

@loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and tdwg/camtrap-dp#32? E.g. for data:

observationType,scientificName
animal,Vulpes vulpes
blank,

Express that scientificName is required when observationType == animal.

This topic is interesting and I have several ideas for dealing with it. They are formalized in this Notebook (working document).

I'm going to work on it for a while longer, keeping in mind:

and at the same time: the way pandas treats this type of subject and the links with the proposition of 'relationships'.

-> I welcome any ideas or comments on the subject !

peterdesmet commented 5 months ago

I welcome any ideas or comments on the subject !

FYI, the ifField suggestion in your notebook reminds me of a syntax we once invented at @inbo for data validation, which also had the option to define conditional constraints (defined at field level, not resource level): https://github.com/inbo/whip/blob/master/docs/syntax.md#if. It worked quite well, but we abandoned it in favour of Table Schema. I think you're on the right track investigating the link between relationships and conditional constraints.

loco-philippe commented 4 months ago

@pschumm @roll

As noted above, I was able to spend some time in February understanding Table Schema and implementing an example solution for the relationships descriptor. The solution presented in this Notebook works and shows that we can implement this property without too much difficulty (better integration is undoubtedly to be expected).

I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account)

@roll, how do you see the continuation of this pattern?

loco-philippe commented 4 months ago

@peterdesmet

I took the time to look at how table schema works and I think I can implement a solution for the conditional requirements question (before the end of February).

loco-philippe commented 4 months ago

@peterdesmet

Finally, the implementation is simpler than expected !

This second Notebook presents the solution (which works).

Do you have more complete examples that I can test to validate the proposed solution?

Note: how to take this proposal into account (addition to the Pattern document, addition of comments to open issues)?

peterdesmet commented 4 months ago

Do you have more complete examples that I can test to validate the proposed solution?

Yes, here's a more complex use case. It has two fields:

The (slightly adapted) rules are:

  1. if measurementType = cloudiness then measurementValue:

    • type = string
    • constraints.enum = ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
  2. If measurementType = temperature then measurementValue:

    • type = number
    • constraints.min = 0
    • constraints.max = 20
  3. If measurementType = wind force then measurementValue:

    • type = integer
    • constraints.enum = [0, 1, 2, 3, 4, 5]

Not sure all of this should be expressible in Table Schema, but it is a real use case. 😄

loco-philippe commented 4 months ago

Yes it works (exactly as you express the constraints) !

I added this example in the Jupyter notebook (example 2).

pschumm commented 4 months ago

The title of this PR includes "Add pattern...", however @roll linked this to #862 as closing that issue, which involves promoting uniqueness constraints to the spec. So are we talking about a pattern or a change to the spec here? Just want to make sure we're clear on this when discussing.

While these various constraints are all important, I think we should try to articulate what makes a specific constraint appropriate for inclusion in the spec versus something that should be implemented using a checklist or inquiry. Otherwise, any potential constraint would be eligible for consideration to be included in the spec. And I don't think we want that if we want to avoid unnecessary complexity in the spec.

I'm afraid I can't offer a specific proposal at the moment, but I'm keen to hear others' thoughts on this.

loco-philippe commented 4 months ago

I admit that I did not understand the dependency between Pattern - unique constraint and Pattern - relationships.

Example:

In this example product and plants fields are derived

product plants quantity price
apple fruit 1 kg 1
apple fruit 10 kg 10
carrot vegetable 1 kg 1.5
carrot vegetable 10 kg 20

The relationship is validated for the previous table but not for the next one :

product plants quantity price
apple fruit 1 kg 1
apple fruit 10 kg 10
carrot vegetable 1 kg 1.5
carrot fruit 10 kg 20

How to use unique constraint for this validation ?

khusmann commented 4 months ago

I appreciate all the thought that's gone into this proposal – representing relationships between fields is something that, in general, has a lot of interesting potential directions.

That said, I want to echo @pschumm's concern about this approach being considered for inclusion into the spec. In general, I think there's quite a bit of complexity here that might be better expressed in other ways (e.g. checklists or inquiries). I would prefer to let these ideas simmer a bit longer before promotion to spec-level.

In addition to checklists & inquiries, here's other routes we might want to consider:

  1. I think fields with "conditional types" (e.g. Peter's example), would be better represented by a "tagged union" field type. So rather than evaluating a sequence of "if-else" clauses we'd represent it with a single match on a field discriminator. The advantage of match over the "if-else" pattern is that it encourages exhaustiveness checks as part of the validation (e.g. like the match keyword in python, rust, etc.).

Tagged unions have the benefit of being a well-established, well-understood abstraction already implemented across a myriad of data parsing & validation libraries: for example, python's pydantic; and typescript's zod.

Implementing this behavior as a tagged union field type would also have the advantage of making it a separate, self-contained proposal, decoupled from the more complex idea of "field relationships".

Here's an example of how a tagged union field type might look like for @peterdesmet's example (using the proposed categorical syntax in #875:

{
  "fields": [
    {
      "name": "measurementType",
      "type": "categorical",
      "categories": ["cloudiness", "temperature", "wind force"]
    },
    {
      "name": "measurementValue",
      "type": "union",
      "tag": "measurementType",
      "match": {
        "cloudiness": {
          "type": "categorical",
          "categories": ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
        },
        "temperature": {
          "type": "number",
          "constraints": {
            "min": 0,
            "max": 20
          }
        },
        "wind force": {
          "type": "categorical",
          "categories": [0, 1, 2, 3, 4, 5]
        }
      }
    }
  ]
}

Note that the field-level validation on this type would ensure that all the levels of the measurementType categorical field were represented as keys of the match property in the measurementValue field. For example, if temperature wasn't defined as a key in the match property, this would trigger a validation error because temperature is one of the levels of the measurementType field, As mentioned earlier, this is a common feature of tagged union types.

  1. Regarding unique constraints on tables with relationships between columns, I see this as a side-effect of the data producer not normalizing their tables. If the tables were normalized, the validation is completely taken care of by the existing primaryKey and foreignKeys schema patterns (and has the additional advantage of being supported by a multitude of existing database implementations).

For example, the aforementioned apples table would be better captured by a datapackage with two resources instead of one; one resource to capture the product information, and the other to capture the order line items:

Resource "product-info":

product plants
apple fruit
carrot vegetable

Resource "orders":

product quantity price
apple 1 kg 1
apple 10 kg 10
carrot 1 kg 1.5
carrot 10 kg 20

Now, we can use the following datapackage.json for full validation, with primaryKey and foreignKeys describing the relationship between tables:

{
  "name": "order-database",
  "resources": [
    {
      "name": "product-info",
      "schema": {
        "fields": [
          {
            "name": "product",
            "type": "string"
          },
          {
            "name": "plants",
            "type": "categorical",
            "categories": ["fruit", "vegetable", "flower", "herb"]
          }
        ],
        "primaryKey": "product"
      }
    },
    {
      "name": "orders",
      "schema": {
        "fields": [
          {
            "name": "product",
            "type": "string"
          },
          {
            "name": "quantity",
            "type": "number",
            "bareNumber": false
          },
          {
            "name": "price",
            "type": "number"
          }
        ],
        "foreignKeys": [
          {
            "fields": "product",
            "reference": {
              "resource": "product-info",
              "fields": "product"
            }
          }
        ]
      }
    }
  ]
}

With this datapackage, if the data consumer wishes to obtain a table with columns product, plants, quantity, price, it's simply a join of the two tables.

I realize this requires the data producer to normalize their resource tables, and the producer may not want to for whatever reason. And that's fine! In those cases, I think checklists or inquiries should be used, as mentioned by @pschumm, because they offer full flexibility. But I think adding validation support for relationships within unnormalized tables directly into the spec goes a bit too far – it introduces quite a bit of complexity into the spec for something that is already solved with good database design (or ameliorated with checklists / inquiries).

loco-philippe commented 4 months ago

Thank you @khusmann for these comments.

Here are my remarks:

peterdesmet commented 4 months ago

@khusmann Oh, I like the tagged union approach to conditional validation! Very easy to read. Would you mind submitting this as a separate issue?

loco-philippe commented 4 months ago

I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account)

As indicated in a comment above, i update pattern document (only third option, link to the custom-check).

@roll next steps ?

  • 1 - validation of the PR (are other exchanges to be expected before validation of the PR?)
  • 2 - decision to take into account the "relationship" table descriptor (what is the decision process (vote ?) ?)
  • 3 - choice of validation tool (to be seen after steps 1 and 2 and after a test phase)

I didn't get any feedback on this point (I had in mind a process similar to the one I used for the pandas specifications).

Other questions:

roll commented 3 months ago

@loco-philippe Thanks a lot for the tremendous work and to everyone who participated in the discussion!

As per the Working Group discussion, from Data Package (v2), patterns (recipes) become more autonomous, so we merge it as it is based on your vision, and you control the following evolution of the approach based on user feedback and implementation experience.

roll commented 3 months ago

Also published here - https://datapackage.org/recipes/relationship-between-fields/

loco-philippe commented 3 months ago

@roll , @peterdesmet

Thanks roll,

I think the proposed solution (recipes) is good. It allows you to make a proposal available while waiting for user feedback. But, it also seems necessary to me on the one hand to consider the keyword 'relationship' as assigned or reserved and on the other hand to have an area for sharing custom_checks. I propose to create an issue for each of these two points.

Furthermore, the discussion in this PR also focused on conditional constraints (see request @peterdesmet). It would also be interesting to formalize the proposal in the form of another recipe (I will add a comment to this effect in issue #169 identified by Peter). What do you think about it?

roll commented 3 months ago

Thanks @loco-philippe,

Thanks! It will be great! Also, it's a good point about recipe namespace for properties. I think we need to discuss it on the next Working Group meeting (cc @sapetti9)