Open ezwelty opened 4 months ago
Which isn't what a reader is likely to assume if we say that null values are "excluded". Or what a data packager may want, hence why I originally proposed the ability to specify uniqueNulls: false (e.g. the behavior used by Python and R).
As a heavy Python / R user and light SQL user, yes, that behavior is surprising for me with the word "excluded". It also seems nice to be able to switch between the behaviors. But for now given the time constraints I think we should probably just focus on matching the SQL behavior, so we can have frictionless schemas that accurately represent SQL DBs. (we can add more options like uniqueNulls
later, right?)
Maybe in addition to changing the language we could provide some examples for folks like me who are inexperienced with these sorts of SQL details?
If so, per the SQL standard, this would require (although nowhere stated) at least a uniqueKey (if not a primaryKey) on the reference field(s). Do we have any opinion on this?
My impressions given my limited background:
So my instinct would be to go the strict / simple route and require a primaryKey – but I defer to folks like you with more SQL / DB design experience.
Is this how we intend foreign keys to operate?
That behavior would surprise me! I would expect all non-null foreign keys to be capable of valid joins to their primary keys, so [1, null] would not be allowed. Is there an important reason this is the behavior in the SQL standard? Oh – is it because you might have multiple transactions to build the foreign key, so you need to allow incomplete foreign keys in the meantime?
In any case, if our goal here is to accurately describe an arbitrary SQL DB, then wouldn't we need to follow this behavior? Or if the behavior is only used in edge cases (multiple transactions to build the foreign key), should we take a more strict route and only allow complete foreign keys? Sorry I'm not more help on the design here, other than a reference for what might be expected by your average Python / R user.
One more thought – These issues potentially get further complicated by tagged missing values. So if we have:
a | b |
---|---|
1 | SKIPPED |
1 | OMITTED |
1 | SKIPPED |
where missingValues = ["SKIPPED", "OMITTED"]
, then I can imagine at least 3 policies for determining uniqueness of missing values:
I wrote the original
uniqueKeys
pattern (https://specs.frictionlessdata.io/patterns/#table-schema-unique-constraints) and only noticed during the documentation review yesterday that a version ofuniqueKeys
made it into the specs.I'm fine with us supporting only the SQL standard (i.e. null values are unique) and dropping the need for my originally-proposed
uniqueNulls
property. However, I believe the documentation (and perhaps the specs) could be improved with regard to how null values are thus handled. The short fix would be to clarify that composite unique keys are excluded from the uniqueness check if they contain at least onenull
value. But maybe this is worth a longer explanation:uniqueKeys
The documentation currently states:
I find this misleading, as the SQL standard considers
null
values unique (or maybe more precisely, distinct), meaning for example that these two rows are unique and thus considered valid foruniqueKey
[a, b]:Which isn't what a reader is likely to assume if we say that
null
values are "excluded". Or what a data packager may want, hence why I originally proposed the ability to specifyuniqueNulls: false
(e.g. the behavior used by Python and R).foreignKeys
The documentation currently states:
If so, per the SQL standard, this would require (although nowhere stated) at least a
uniqueKey
(if not aprimaryKey
) on the reference field(s). Do we have any opinion on this?Furthermore (and regardless of the above), per SQL standards, local field values that are not in the reference fields are permitted if at least one of the local fields is
null
. These two tables would be considered valid forforeignKey
local [a, b] → reference [a, b]:Is this how we intend foreign keys to operate?