Add pattern - Table Schema: Relationship between Fields

loco-philippe commented 6 months ago

This pull request adds a pattern to express relationships between Fields and define associated validation tool.

This PR follows the issue #803 and is proposed by @roll.

Several option are proposed for the associated Descriptor.

Thank you very much for your review and comments.

Note: I also added a Table of contents

roll commented 6 months ago

Thanks @loco-philippe!

I'm going to request a review from Working Group on Monday

loco-philippe commented 6 months ago

Thank you @roll !

Is there enough information or should I add something ?

pschumm commented 5 months ago

First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:

1) Of all three options, I believe the most appropriate would be a new property in the table descriptor since, as the author notes, a relationship between two or more fields is not a property of an individual field per se.

2) I don't find the examples given to be very compelling. For example, I wouldn't typically store quarter if I had month, since quarter can be easily calculated from month. And if you do want to include it in the data, that relationship (i.e., what you are referring to as derived) can be easily validated (and clearly documented) by specifying a row constraint. Similarly, I'm not sure that your "nicknames example" truly demonstrates a coupled relationship, as the relationship between names and nicknames is many-to-one (i.e., two names may share the same nickname). Finally, your example of a crossed relationship not only represents a conceptual relationship between year and semester, but also a property of the specific dataset (i.e., that it includes an observation for every possible year $\times$ semester combination).

3) More generally, IIUC, what we are really talking about here is validating many-to-one (i.e., derived), one-to-one (i.e., coupled), and many-to-many (with all possible combinations present only once, or fully crossed) relationships between fields. And these can currently be accomplished with a row constraint (which not only verifies the cardinality of the derived relationship but also its precise functional form), by moving one field to a separate table and using the existing primaryKey and foreignKeys properties, or by providing a custom validation check.

Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases). That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks.

loco-philippe commented 5 months ago

comment @pschumm

First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:

Thank you also for this relevant and constructive feedback !

Of all three options, I believe the most appropriate would be a new property in the table descriptor since, as the author notes, a relationship between two or more fields is not a property of an individual field per se.

I agree with this analysis.

I don't find the examples given to be very compelling. For example, I wouldn't typically store quarter if I had month, since quarter can be easily calculated from month. And if you do want to include it in the data, that relationship (i.e., what you are referring to as derived) can be easily validated (and clearly documented) by specifying a row constraint. Similarly, I'm not sure that your "nicknames example" truly demonstrates a coupled relationship, as the relationship between names and nicknames is many-to-one (i.e., two names may share the same nickname). Finally, your example of a crossed relationship not only represents a conceptual relationship between year and semester, but also a property of the specific dataset (i.e., that it includes an observation for every possible year × semester combination).

Indeed, the examples chosen are too basic and unrealistic (I used another example in the patterns file but which can also be considered unrealistic). On the other hand, also in the patterns file (chapter "principles"), I included a link to a real (simplified) open-data example concerning Electric vehicle charging infrastructures (stations) .

More generally, IIUC, what we are really talking about here is validating many-to-one (i.e., derived), one-to-one (i.e., coupled), and many-to-many (with all possible combinations present only once, or fully crossed) relationships between fields. And these can currently be accomplished with a row constraint (which not only verifies the cardinality of the derived relationship but also its precise functional form), by moving one field to a separate table and using the existing primaryKey and foreignKeys properties, or by providing a custom validation check.

Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases).

The french IRVE open-data (used for the simplified example shown above) is interesting to look at:

the data is complex (60,000 row and 50 fields)
the file production process is complex (daily update based on the addition of new stations by several operators)
it uses a data schema and associated validation tools
users complain about integrity problems (relations between fields are not checked and the error rate is 5%)
a data model, integrity indicators and error identifications have been set up.

That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks.

Regarding the subject as a whole, in my opinion there are several points to consider:

integrity of a dataset in relation to the relationships between fields: This need is in my opinion fundamental because if it is not integrity, the extracted data could be false (see open-data example indicated above). I understand from your response that you share this point of view by giving the possibility of carrying out this type of control.
relationship identification methodology: The relationships between fields are not coincidental and arise from the structure of the information represented in the tabular dataset. In complex data sets, this structure is described by a data model which identifies the relationships between entities (example: the open-data methodology of France specifies this approach in two stages data model - data schema). There are therefore two possible options for Table schema:
- either the relationships identified in the model are translated into the data schema (just as the entities of the model are translated into fields) and in this case, the data schema contains all the rules and constraints to be checked,
- either the relationships identified in the model are not taken into account in the data schema and in this case, a description of the relationships in a formalism to be defined must be provided in another file associated with the data schema.
Personally, I am in favor of the first option and I would like to be able to convince you of it !
error control and identification tool: I have no comments on the choice of a solution. On the other hand, I am interested in the solution you propose and I would like to do some tests (is the V4 python framework sufficient ? Do you have other recommendations or examples that I can use ?)

pschumm commented 5 months ago

I have no comments on the choice of a solution. On the other hand, I am interested in the solution you propose and I would like to do some tests (is the V4 python framework sufficient ? Do you have other recommendations or examples that I can use ?)

If you can, I would focus on v5 since it's been out for a while now. My suggestion would be to write a plugin to provide a new "relationship" check that takes three arguments: name of field 1, name of field 2, and type of relationship (i.e., derived, coupled or crossed). It might be helpful to look at the row-constraint check as an example.

Note that you will most likely want to specify all of validate_start() (to verify that both fields are present in the resource), validate_row() (to build up your two lists of values as the data are streamed), and validate_end() (to do the appropriate check based on the specified relationship).

Once your new check is available, you may then specify it in your data package via an inquiry. This makes it easy to validate the data, regardless of whether you're working with the CLI or the Python API. And as I said above, I think it would then be more appropriate to add this check to the list of built-in checks, rather than to add it to the official table schema.

If you need help doing this, let me know (but I'm afraid I'm tied up with another project at the moment). And I defer to @roll if he wants to suggest an alternative strategy.

loco-philippe commented 5 months ago

Thank you @pschumm for these detailed answers !

I also won't have much time to work on this implementation in January but I think I can work on it in February.

On the other hand, @roll, in terms of methodology, how do we create a data schema from a data model (if the notion of relationship does not exist) and then validate that a dataset respects the constraints expressed in the data model (it would be difficult to explain that Table Schema processes the data included in the Fields but does not concern itself with the structure of the Dataset)?

(of course, from my point of view, the answer is to include the notion of relationship in a data schema and use a corresponding validation tool!)

loco-philippe commented 5 months ago

Here is a first summary of the discussions:

object of the proposal (representation and validation of fields relationships): -> agreement on the need to facilitate validation of the defined types of relationships
use case: indicated in the pattern (simplified + complete use case) -> not (yet) examined
representation of relationships (3 options): -> convergence on the option of taking into account by a table descriptor
specification: -> examples used not relevant (they were updated following comments)
validation tool: -> proposal for built-in validation checks (addition of a plugin)

@roll next steps ?

1 - validation of the PR (are other exchanges to be expected before validation of the PR?)
2 - decision to take into account the "relationship" table descriptor (what is the decision process (vote ?) ?)
3 - choice of validation tool (to be seen after steps 1 and 2 and after a test phase)

peterdesmet commented 5 months ago

This is an interesting proposal. @loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and https://github.com/tdwg/camtrap-dp/issues/32? E.g. for data:

observationType,scientificName
animal,Vulpes vulpes
blank,

Express that scientificName is required when observationType == animal.

loco-philippe commented 5 months ago

This is an interesting proposal.

Thank you @peterdesmet !

@loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and tdwg/camtrap-dp#32? E.g. for data:
observationType,scientificName
animal,Vulpes vulpes
blank,
Express that scientificName is required when observationType == animal.

This topic is interesting and I have several ideas for dealing with it. They are formalized in this Notebook (working document).

I'm going to work on it for a while longer, keeping in mind:

how JSON schema works
the necessary adaptation to tabular data

and at the same time: the way pandas treats this type of subject and the links with the proposition of 'relationships'.

-> I welcome any ideas or comments on the subject !

peterdesmet commented 5 months ago

I welcome any ideas or comments on the subject !

FYI, the ifField suggestion in your notebook reminds me of a syntax we once invented at @inbo for data validation, which also had the option to define conditional constraints (defined at field level, not resource level): https://github.com/inbo/whip/blob/master/docs/syntax.md#if. It worked quite well, but we abandoned it in favour of Table Schema. I think you're on the right track investigating the link between relationships and conditional constraints.

loco-philippe commented 4 months ago

@pschumm @roll

As noted above, I was able to spend some time in February understanding Table Schema and implementing an example solution for the relationships descriptor. The solution presented in this Notebook works and shows that we can implement this property without too much difficulty (better integration is undoubtedly to be expected).

I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account)

@roll, how do you see the continuation of this pattern?

loco-philippe commented 4 months ago

@peterdesmet

I took the time to look at how table schema works and I think I can implement a solution for the conditional requirements question (before the end of February).

loco-philippe commented 4 months ago

@peterdesmet

Finally, the implementation is simpler than expected !

This second Notebook presents the solution (which works).

Do you have more complete examples that I can test to validate the proposed solution?

Note: how to take this proposal into account (addition to the Pattern document, addition of comments to open issues)?

peterdesmet commented 4 months ago

Do you have more complete examples that I can test to validate the proposed solution?

Yes, here's a more complex use case. It has two fields:

measurementType: can have the values cloudiness, temperature and wind force
measurementValue: which has actual measurement, with a type and values that depend on measurementType.

The (slightly adapted) rules are:

if measurementType = cloudiness then measurementValue:
- type = string
- constraints.enum = ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
If measurementType = temperature then measurementValue:
- type = number
- constraints.min = 0
- constraints.max = 20
If measurementType = wind force then measurementValue:
- type = integer
- constraints.enum = [0, 1, 2, 3, 4, 5]

Not sure all of this should be expressible in Table Schema, but it is a real use case. 😄

loco-philippe commented 4 months ago

Yes it works (exactly as you express the constraints) !

I added this example in the Jupyter notebook (example 2).

pschumm commented 4 months ago

The title of this PR includes "Add pattern...", however @roll linked this to #862 as closing that issue, which involves promoting uniqueness constraints to the spec. So are we talking about a pattern or a change to the spec here? Just want to make sure we're clear on this when discussing.

While these various constraints are all important, I think we should try to articulate what makes a specific constraint appropriate for inclusion in the spec versus something that should be implemented using a checklist or inquiry. Otherwise, any potential constraint would be eligible for consideration to be included in the spec. And I don't think we want that if we want to avoid unnecessary complexity in the spec.

I'm afraid I can't offer a specific proposal at the moment, but I'm keen to hear others' thoughts on this.

loco-philippe commented 4 months ago

I admit that I did not understand the dependency between Pattern - unique constraint and Pattern - relationships.

Example:

In this example product and plants fields are derived

product	plants	quantity	price
apple	fruit	1 kg	1
apple	fruit	10 kg	10
carrot	vegetable	1 kg	1.5
carrot	vegetable	10 kg	20

The relationship is validated for the previous table but not for the next one :

product	plants	quantity	price
apple	fruit	1 kg	1
apple	fruit	10 kg	10
carrot	vegetable	1 kg	1.5
carrot	fruit	10 kg	20

How to use unique constraint for this validation ?

khusmann commented 4 months ago

I appreciate all the thought that's gone into this proposal – representing relationships between fields is something that, in general, has a lot of interesting potential directions.

That said, I want to echo @pschumm's concern about this approach being considered for inclusion into the spec. In general, I think there's quite a bit of complexity here that might be better expressed in other ways (e.g. checklists or inquiries). I would prefer to let these ideas simmer a bit longer before promotion to spec-level.

In addition to checklists & inquiries, here's other routes we might want to consider:

I think fields with "conditional types" (e.g. Peter's example), would be better represented by a "tagged union" field type. So rather than evaluating a sequence of "if-else" clauses we'd represent it with a single match on a field discriminator. The advantage of match over the "if-else" pattern is that it encourages exhaustiveness checks as part of the validation (e.g. like the match keyword in python, rust, etc.).

Tagged unions have the benefit of being a well-established, well-understood abstraction already implemented across a myriad of data parsing & validation libraries: for example, python's pydantic; and typescript's zod.

Implementing this behavior as a tagged union field type would also have the advantage of making it a separate, self-contained proposal, decoupled from the more complex idea of "field relationships".

Here's an example of how a tagged union field type might look like for @peterdesmet's example (using the proposed categorical syntax in #875:

{
  "fields": [
    {
      "name": "measurementType",
      "type": "categorical",
      "categories": ["cloudiness", "temperature", "wind force"]
    },
    {
      "name": "measurementValue",
      "type": "union",
      "tag": "measurementType",
      "match": {
        "cloudiness": {
          "type": "categorical",
          "categories": ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
        },
        "temperature": {
          "type": "number",
          "constraints": {
            "min": 0,
            "max": 20
          }
        },
        "wind force": {
          "type": "categorical",
          "categories": [0, 1, 2, 3, 4, 5]
        }
      }
    }
  ]
}

Note that the field-level validation on this type would ensure that all the levels of the measurementType categorical field were represented as keys of the match property in the measurementValue field. For example, if temperature wasn't defined as a key in the match property, this would trigger a validation error because temperature is one of the levels of the measurementType field, As mentioned earlier, this is a common feature of tagged union types.

Regarding unique constraints on tables with relationships between columns, I see this as a side-effect of the data producer not normalizing their tables. If the tables were normalized, the validation is completely taken care of by the existing primaryKey and foreignKeys schema patterns (and has the additional advantage of being supported by a multitude of existing database implementations).

For example, the aforementioned apples table would be better captured by a datapackage with two resources instead of one; one resource to capture the product information, and the other to capture the order line items:

Resource "product-info":

product	plants
apple	fruit
carrot	vegetable

Resource "orders":

product	quantity	price
apple	1 kg	1
apple	10 kg	10
carrot	1 kg	1.5
carrot	10 kg	20

Now, we can use the following datapackage.json for full validation, with primaryKey and foreignKeys describing the relationship between tables:

{
  "name": "order-database",
  "resources": [
    {
      "name": "product-info",
      "schema": {
        "fields": [
          {
            "name": "product",
            "type": "string"
          },
          {
            "name": "plants",
            "type": "categorical",
            "categories": ["fruit", "vegetable", "flower", "herb"]
          }
        ],
        "primaryKey": "product"
      }
    },
    {
      "name": "orders",
      "schema": {
        "fields": [
          {
            "name": "product",
            "type": "string"
          },
          {
            "name": "quantity",
            "type": "number",
            "bareNumber": false
          },
          {
            "name": "price",
            "type": "number"
          }
        ],
        "foreignKeys": [
          {
            "fields": "product",
            "reference": {
              "resource": "product-info",
              "fields": "product"
            }
          }
        ]
      }
    }
  ]
}

With this datapackage, if the data consumer wishes to obtain a table with columns product, plants, quantity, price, it's simply a join of the two tables.

I realize this requires the data producer to normalize their resource tables, and the producer may not want to for whatever reason. And that's fine! In those cases, I think checklists or inquiries should be used, as mentioned by @pschumm, because they offer full flexibility. But I think adding validation support for relationships within unnormalized tables directly into the spec goes a bit too far – it introduces quite a bit of complexity into the spec for something that is already solved with good database design (or ameliorated with checklists / inquiries).

loco-philippe commented 4 months ago

Thank you @khusmann for these comments.

Here are my remarks:

conditional type :
- The proposition of tagged union is indeed better suited to the example proposed by @peterdesmet than the use of if then.
- The two solutions are nevertheless complementary: match is suitable for categorical fields, if then is suitable for other types of controls (e.g. "if field1 > 10 then field2 <0). It is similarly with allOf, anyOf and oneOf
- The use of if then else allOf anyOf and oneOf is compatible with JSON Schema (I seemed to have read that it was desirable to maintain consistency with JSON Schema keywords , but I'm not sure)
fields relationships :
- I agree that with normalized data, this type of control is unnecessary.
- On the other hand, if I take the example of French open-data, more than 80% of the datasets are single-file and not normalized, it is therefore necessary to have this type of control
decoupling :

I also agree about decoupling these needs / issues:
- relationships is a “table” level need which cannot be handled at row level
- match is a “field” level need that can be handled at row level
- if then else allOf anyOf oneOf are “table” level requirements which can be processed at row level
spec :

The notion of 'in spec' and 'off spec' is not clear to me. Are there two spec levels: one for keywords and schema syntax and another for tools ? If so, couldn't this be handled initially only at this first level (which would allow processing either with custom checks tools or with internal tools) ?

peterdesmet commented 4 months ago

@khusmann Oh, I like the tagged union approach to conditional validation! Very easy to read. Would you mind submitting this as a separate issue?

loco-philippe commented 4 months ago

I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account)

As indicated in a comment above, i update pattern document (only third option, link to the custom-check).

@roll next steps ?

1 - validation of the PR (are other exchanges to be expected before validation of the PR?)

2 - decision to take into account the "relationship" table descriptor (what is the decision process (vote ?) ?)

3 - choice of validation tool (to be seen after steps 1 and 2 and after a test phase)

I didn't get any feedback on this point (I had in mind a process similar to the one I used for the pandas specifications).

frictionlessdata / datapackage

Add pattern - Table Schema: Relationship between Fields #859