Closed loco-philippe closed 3 months ago
Thanks @loco-philippe!
I'm going to request a review from Working Group on Monday
Thank you @roll !
Is there enough information or should I add something ?
First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:
1) Of all three options, I believe the most appropriate would be a new property in the table descriptor since, as the author notes, a relationship between two or more fields is not a property of an individual field per se.
2) I don't find the examples given to be very compelling. For example, I wouldn't typically store quarter if I had month, since quarter can be easily calculated from month. And if you do want to include it in the data, that relationship (i.e., what you are referring to as derived) can be easily validated (and clearly documented) by specifying a row constraint. Similarly, I'm not sure that your "nicknames example" truly demonstrates a coupled relationship, as the relationship between names and nicknames is many-to-one (i.e., two names may share the same nickname). Finally, your example of a crossed relationship not only represents a conceptual relationship between year and semester, but also a property of the specific dataset (i.e., that it includes an observation for every possible year $\times$ semester combination).
3) More generally, IIUC, what we are really talking about here is validating many-to-one (i.e., derived), one-to-one (i.e., coupled), and many-to-many (with all possible combinations present only once, or fully crossed) relationships between fields. And these can currently be accomplished with a row constraint (which not only verifies the cardinality of the derived relationship but also its precise functional form), by moving one field to a separate table and using the existing primaryKey
and foreignKeys
properties, or by providing a custom validation check.
Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases). That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks.
First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:
Thank you also for this relevant and constructive feedback !
- Of all three options, I believe the most appropriate would be a new property in the table descriptor since, as the author notes, a relationship between two or more fields is not a property of an individual field per se.
I agree with this analysis.
- I don't find the examples given to be very compelling. For example, I wouldn't typically store quarter if I had month, since quarter can be easily calculated from month. And if you do want to include it in the data, that relationship (i.e., what you are referring to as derived) can be easily validated (and clearly documented) by specifying a row constraint. Similarly, I'm not sure that your "nicknames example" truly demonstrates a coupled relationship, as the relationship between names and nicknames is many-to-one (i.e., two names may share the same nickname). Finally, your example of a crossed relationship not only represents a conceptual relationship between year and semester, but also a property of the specific dataset (i.e., that it includes an observation for every possible year × semester combination).
Indeed, the examples chosen are too basic and unrealistic (I used another example in the patterns file
but which can also be considered unrealistic). On the other hand, also in the patterns file
(chapter "principles"), I included a link to a real (simplified) open-data example concerning Electric vehicle charging infrastructures (stations) .
- More generally, IIUC, what we are really talking about here is validating many-to-one (i.e., derived), one-to-one (i.e., coupled), and many-to-many (with all possible combinations present only once, or fully crossed) relationships between fields. And these can currently be accomplished with a row constraint (which not only verifies the cardinality of the derived relationship but also its precise functional form), by moving one field to a separate table and using the existing
primaryKey
andforeignKeys
properties, or by providing a custom validation check.Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases).
The french IRVE open-data (used for the simplified example shown above) is interesting to look at:
That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks.
Regarding the subject as a whole, in my opinion there are several points to consider:
integrity of a dataset in relation to the relationships between fields: This need is in my opinion fundamental because if it is not integrity, the extracted data could be false (see open-data example indicated above). I understand from your response that you share this point of view by giving the possibility of carrying out this type of control.
relationship identification methodology: The relationships between fields are not coincidental and arise from the structure of the information represented in the tabular dataset. In complex data sets, this structure is described by a data model which identifies the relationships between entities (example: the open-data methodology of France specifies this approach in two stages data model - data schema). There are therefore two possible options for Table schema:
Personally, I am in favor of the first option and I would like to be able to convince you of it !
error control and identification tool: I have no comments on the choice of a solution. On the other hand, I am interested in the solution you propose and I would like to do some tests (is the V4 python framework sufficient ? Do you have other recommendations or examples that I can use ?)
I have no comments on the choice of a solution. On the other hand, I am interested in the solution you propose and I would like to do some tests (is the V4 python framework sufficient ? Do you have other recommendations or examples that I can use ?)
If you can, I would focus on v5 since it's been out for a while now. My suggestion would be to write a plugin to provide a new "relationship" check that takes three arguments: name of field 1, name of field 2, and type of relationship (i.e., derived, coupled or crossed). It might be helpful to look at the row-constraint check as an example.
Note that you will most likely want to specify all of validate_start() (to verify that both fields are present in the resource), validate_row() (to build up your two lists of values as the data are streamed), and validate_end() (to do the appropriate check based on the specified relationship).
Once your new check is available, you may then specify it in your data package via an inquiry. This makes it easy to validate the data, regardless of whether you're working with the CLI or the Python API. And as I said above, I think it would then be more appropriate to add this check to the list of built-in checks, rather than to add it to the official table schema.
If you need help doing this, let me know (but I'm afraid I'm tied up with another project at the moment). And I defer to @roll if he wants to suggest an alternative strategy.
Thank you @pschumm for these detailed answers !
I also won't have much time to work on this implementation in January but I think I can work on it in February.
On the other hand, @roll, in terms of methodology, how do we create a data schema from a data model (if the notion of relationship does not exist) and then validate that a dataset respects the constraints expressed in the data model (it would be difficult to explain that Table Schema processes the data included in the Fields but does not concern itself with the structure of the Dataset)?
(of course, from my point of view, the answer is to include the notion of relationship in a data schema and use a corresponding validation tool!)
Here is a first summary of the discussions:
@roll next steps ?
This is an interesting proposal. @loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and https://github.com/tdwg/camtrap-dp/issues/32? E.g. for data:
observationType,scientificName
animal,Vulpes vulpes
blank,
Express that scientificName
is required when observationType == animal
.
This is an interesting proposal.
Thank you @peterdesmet !
@loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and tdwg/camtrap-dp#32? E.g. for data:
observationType,scientificName animal,Vulpes vulpes blank,
Express that
scientificName
is required whenobservationType == animal
.
This topic is interesting and I have several ideas for dealing with it. They are formalized in this Notebook (working document).
I'm going to work on it for a while longer, keeping in mind:
and at the same time: the way pandas treats this type of subject and the links with the proposition of 'relationships'.
-> I welcome any ideas or comments on the subject !
I welcome any ideas or comments on the subject !
FYI, the ifField
suggestion in your notebook reminds me of a syntax we once invented at @inbo for data validation, which also had the option to define conditional constraints (defined at field level, not resource level): https://github.com/inbo/whip/blob/master/docs/syntax.md#if. It worked quite well, but we abandoned it in favour of Table Schema. I think you're on the right track investigating the link between relationships and conditional constraints.
@pschumm @roll
As noted above, I was able to spend some time in February understanding Table Schema and implementing an example solution for the relationships
descriptor.
The solution presented in this Notebook works and shows that we can implement this property without too much difficulty (better integration is undoubtedly to be expected).
I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account)
@roll, how do you see the continuation of this pattern?
@peterdesmet
I took the time to look at how table schema works and I think I can implement a solution for the conditional requirements question (before the end of February).
@peterdesmet
Finally, the implementation is simpler than expected !
This second Notebook presents the solution (which works).
Do you have more complete examples that I can test to validate the proposed solution?
Note: how to take this proposal into account (addition to the Pattern document, addition of comments to open issues)?
Do you have more complete examples that I can test to validate the proposed solution?
Yes, here's a more complex use case. It has two fields:
measurementType
: can have the values cloudiness
, temperature
and wind force
measurementValue
: which has actual measurement, with a type and values that depend on measurementType
.The (slightly adapted) rules are:
if measurementType = cloudiness
then measurementValue
:
type = string
constraints.enum = ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
If measurementType = temperature
then measurementValue
:
type = number
constraints.min = 0
constraints.max = 20
If measurementType = wind force
then measurementValue
:
type = integer
constraints.enum = [0, 1, 2, 3, 4, 5]
Not sure all of this should be expressible in Table Schema, but it is a real use case. 😄
Yes it works (exactly as you express the constraints) !
I added this example in the Jupyter notebook (example 2).
The title of this PR includes "Add pattern...", however @roll linked this to #862 as closing that issue, which involves promoting uniqueness constraints to the spec. So are we talking about a pattern or a change to the spec here? Just want to make sure we're clear on this when discussing.
While these various constraints are all important, I think we should try to articulate what makes a specific constraint appropriate for inclusion in the spec versus something that should be implemented using a checklist or inquiry. Otherwise, any potential constraint would be eligible for consideration to be included in the spec. And I don't think we want that if we want to avoid unnecessary complexity in the spec.
I'm afraid I can't offer a specific proposal at the moment, but I'm keen to hear others' thoughts on this.
I admit that I did not understand the dependency between Pattern - unique constraint
and Pattern - relationships
.
Example:
In this example product
and plants
fields are derived
product | plants | quantity | price |
---|---|---|---|
apple | fruit | 1 kg | 1 |
apple | fruit | 10 kg | 10 |
carrot | vegetable | 1 kg | 1.5 |
carrot | vegetable | 10 kg | 20 |
The relationship is validated for the previous table but not for the next one :
product | plants | quantity | price |
---|---|---|---|
apple | fruit | 1 kg | 1 |
apple | fruit | 10 kg | 10 |
carrot | vegetable | 1 kg | 1.5 |
carrot | fruit | 10 kg | 20 |
How to use unique constraint
for this validation ?
I appreciate all the thought that's gone into this proposal – representing relationships between fields is something that, in general, has a lot of interesting potential directions.
That said, I want to echo @pschumm's concern about this approach being considered for inclusion into the spec. In general, I think there's quite a bit of complexity here that might be better expressed in other ways (e.g. checklists or inquiries). I would prefer to let these ideas simmer a bit longer before promotion to spec-level.
In addition to checklists & inquiries, here's other routes we might want to consider:
match
on a field discriminator. The advantage of match
over the "if-else" pattern is that it encourages exhaustiveness checks as part of the validation (e.g. like the match
keyword in python, rust, etc.).Tagged unions have the benefit of being a well-established, well-understood abstraction already implemented across a myriad of data parsing & validation libraries: for example, python's pydantic; and typescript's zod.
Implementing this behavior as a tagged union field type would also have the advantage of making it a separate, self-contained proposal, decoupled from the more complex idea of "field relationships".
Here's an example of how a tagged union field type might look like for @peterdesmet's example (using the proposed categorical syntax in #875:
{
"fields": [
{
"name": "measurementType",
"type": "categorical",
"categories": ["cloudiness", "temperature", "wind force"]
},
{
"name": "measurementValue",
"type": "union",
"tag": "measurementType",
"match": {
"cloudiness": {
"type": "categorical",
"categories": ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
},
"temperature": {
"type": "number",
"constraints": {
"min": 0,
"max": 20
}
},
"wind force": {
"type": "categorical",
"categories": [0, 1, 2, 3, 4, 5]
}
}
}
]
}
Note that the field-level validation on this type would ensure that all the levels of the measurementType
categorical field were represented as keys of the match
property in the measurementValue
field. For example, if temperature
wasn't defined as a key in the match
property, this would trigger a validation error because temperature
is one of the levels of the measurementType
field, As mentioned earlier, this is a common feature of tagged union types.
unique
constraints on tables with relationships between columns, I see this as a side-effect of the data producer not normalizing their tables. If the tables were normalized, the validation is completely taken care of by the existing primaryKey
and foreignKeys
schema patterns (and has the additional advantage of being supported by a multitude of existing database implementations).For example, the aforementioned apples table would be better captured by a datapackage with two resources instead of one; one resource to capture the product information, and the other to capture the order line items:
Resource "product-info":
product | plants |
---|---|
apple | fruit |
carrot | vegetable |
Resource "orders":
product | quantity | price |
---|---|---|
apple | 1 kg | 1 |
apple | 10 kg | 10 |
carrot | 1 kg | 1.5 |
carrot | 10 kg | 20 |
Now, we can use the following datapackage.json
for full validation, with primaryKey
and foreignKeys
describing the relationship between tables:
{
"name": "order-database",
"resources": [
{
"name": "product-info",
"schema": {
"fields": [
{
"name": "product",
"type": "string"
},
{
"name": "plants",
"type": "categorical",
"categories": ["fruit", "vegetable", "flower", "herb"]
}
],
"primaryKey": "product"
}
},
{
"name": "orders",
"schema": {
"fields": [
{
"name": "product",
"type": "string"
},
{
"name": "quantity",
"type": "number",
"bareNumber": false
},
{
"name": "price",
"type": "number"
}
],
"foreignKeys": [
{
"fields": "product",
"reference": {
"resource": "product-info",
"fields": "product"
}
}
]
}
}
]
}
With this datapackage, if the data consumer wishes to obtain a table with columns product, plants, quantity, price, it's simply a join
of the two tables.
I realize this requires the data producer to normalize their resource tables, and the producer may not want to for whatever reason. And that's fine! In those cases, I think checklists or inquiries should be used, as mentioned by @pschumm, because they offer full flexibility. But I think adding validation support for relationships within unnormalized tables directly into the spec goes a bit too far – it introduces quite a bit of complexity into the spec for something that is already solved with good database design (or ameliorated with checklists / inquiries).
Thank you @khusmann for these comments.
Here are my remarks:
conditional type :
tagged union
is indeed better suited to the example proposed by @peterdesmet than the use of if
then
.match
is suitable for categorical fields, if
then
is suitable for other types of controls (e.g. "if field1 > 10 then field2 <0). It is similarly with allOf
, anyOf
and oneOf
if
then
else
allOf
anyOf
and oneOf
is compatible with JSON Schema (I seemed to have read that it was desirable to maintain consistency with JSON Schema keywords , but I'm not sure)fields relationships :
decoupling :
I also agree about decoupling these needs / issues:
relationships
is a “table” level need which cannot be handled at row levelmatch
is a “field” level need that can be handled at row levelif
then
else
allOf
anyOf
oneOf
are “table” level requirements which can be processed at row levelspec :
The notion of 'in spec' and 'off spec' is not clear to me. Are there two spec levels: one for keywords and schema syntax and another for tools ? If so, couldn't this be handled initially only at this first level (which would allow processing either with custom checks tools or with internal tools) ?
@khusmann Oh, I like the tagged union approach to conditional validation! Very easy to read. Would you mind submitting this as a separate issue?
I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account)
As indicated in a comment above, i update pattern document (only third option, link to the custom-check).
@roll next steps ?
- 1 - validation of the PR (are other exchanges to be expected before validation of the PR?)
- 2 - decision to take into account the "relationship" table descriptor (what is the decision process (vote ?) ?)
- 3 - choice of validation tool (to be seen after steps 1 and 2 and after a test phase)
I didn't get any feedback on this point (I had in mind a process similar to the one I used for the pandas specifications).
Other questions:
@loco-philippe Thanks a lot for the tremendous work and to everyone who participated in the discussion!
As per the Working Group discussion, from Data Package (v2), patterns (recipes) become more autonomous, so we merge it as it is based on your vision, and you control the following evolution of the approach based on user feedback and implementation experience.
Also published here - https://datapackage.org/recipes/relationship-between-fields/
@roll , @peterdesmet
Thanks roll,
I think the proposed solution (recipes) is good. It allows you to make a proposal available while waiting for user feedback. But, it also seems necessary to me on the one hand to consider the keyword 'relationship' as assigned or reserved and on the other hand to have an area for sharing custom_checks. I propose to create an issue for each of these two points.
Furthermore, the discussion in this PR also focused on conditional constraints (see request @peterdesmet). It would also be interesting to formalize the proposal in the form of another recipe (I will add a comment to this effect in issue #169 identified by Peter). What do you think about it?
Thanks @loco-philippe,
Thanks! It will be great! Also, it's a good point about recipe namespace for properties. I think we need to discuss it on the next Working Group meeting (cc @sapetti9)
This pull request adds a pattern to express relationships between Fields and define associated validation tool.
This PR follows the issue #803 and is proposed by @roll.
Several option are proposed for the associated Descriptor.
Thank you very much for your review and comments.
Note: I also added a Table of contents