Unclear handing of null in uniqueKeys and (implicitly) foreignKeys

frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

The Unlicense

498 stars 113 forks source link

I wrote the original uniqueKeys pattern (https://specs.frictionlessdata.io/patterns/#table-schema-unique-constraints) and only noticed during the documentation review yesterday that a version of uniqueKeys made it into the specs.

I'm fine with us supporting only the SQL standard (i.e. null values are unique) and dropping the need for my originally-proposed uniqueNulls property. However, I believe the documentation (and perhaps the specs) could be improved with regard to how null values are thus handled. The short fix would be to clarify that composite unique keys are excluded from the uniqueness check if they contain at least one null value. But maybe this is worth a longer explanation:

`uniqueKeys`

The documentation currently states:

All the field values that are on the logical level are considered to be null values MUST be excluded from the uniqueness check, as the uniqueKeys property is modeled on the concept of unique constraint in SQL.

I find this misleading, as the SQL standard considers null values unique (or maybe more precisely, distinct), meaning for example that these two rows are unique and thus considered valid for uniqueKey [a, b]:

a	b
1	null
1	null

Which isn't what a reader is likely to assume if we say that null values are "excluded". Or what a data packager may want, hence why I originally proposed the ability to specify uniqueNulls: false (e.g. the behavior used by Python and R).

`foreignKeys`

The documentation currently states:

They are directly modelled on the concept of foreign keys in SQL

If so, per the SQL standard, this would require (although nowhere stated) at least a uniqueKey (if not a primaryKey) on the reference field(s). Do we have any opinion on this?

Furthermore (and regardless of the above), per SQL standards, local field values that are not in the reference fields are permitted if at least one of the local fields is null. These two tables would be considered valid for foreignKey local [a, b] → reference [a, b]:

reference a	b
2	1
3	1

local a	b
1	null

Is this how we intend foreign keys to operate?

a	b
1	SKIPPED
1	OMITTED
1	SKIPPED

SKIPPED

OMITTED

SKIPPED

frictionlessdata / datapackage

Unclear handing of null in uniqueKeys and (implicitly) foreignKeys #941

uniqueKeys

foreignKeys

`uniqueKeys`

`foreignKeys`