Closed gabywh closed 3 years ago
So you mean that a date of birth only containing the year should be allowed? When was this decision made?
Informing @sondaica also...
Such a breaking change of the schema will require changes to all existing implementations. The schema uses format date
and therefore requires a full date as YYYY-MM-DD.
This will make the schema version go to 2.0.0 as the change is not backwards compatible.
How is that according to the regulation? It states date of birth.
It was found that it is somewhat common for dates of birth to not be exactly known; e.g. when just a year is known; or the UNHCR style "summer" or "winter" (in NL - about 70k people).
As a result - it is common for passports not to contain these. Some countries round this to 01-01 and/or 06-01 -- but not all do. A variation with just the year and month known - but not the date is also happening.
The goal of the DOB is binding with the passport.
As the passport ALSO does NOT contain these - the regex needs adjustment.
So then we maybe should use regexp according to ICAO Machine Readable Travel Document recommendation with the use of X for unknown elements. For example YYYY-XX-XX when only YYYY is known.
We also need to remove the "format": "date"
from the definition of dob
.
This small change will have rather big impact on all implementations. And I agree with @jschlyter that this is a breaking change.
This one should work for all and is fully backwards compatible:
"dob": {
"title": "Date of birth",
"description": "Date of Birth of the person addressed in the DGC. ISO 8601 date format restricted to range 1900-2099",
"type": "string",
"oneOf": [
{
"format": "date",
"pattern": "^(?:((19|20)\\d{2}-\\d{2}-\\d{2})|((19|20)\\d{2}))$",
"examples": [
"1979-04-14",
"1979"
]
},
{
"pattern": "^((19|20)\\d{2}-XX-XX)$",
"examples": [
"1979-XX-XX"
]
}
]
},
@miguelzapaton that is better, but since the date format is no longer mandatory it is still a breaking change. I see no other option than to move to 2.0.0 (or break the semantic version scheme).
@miguelzapaton that is better, but since the date format is no longer mandatory it is still a breaking change. I see no other option than to move to 2.0.0 (or break the semantic version scheme).
If format: 'date' is removed then a date like 1979-99-44 would be validated correctly.
Imo there's no breaking change in this case as dob it is still type: 'string' (it was never a real date type) and the change is fully backwards compatible.
oneOf would no longer be needed and it could be defined like this (untested!):
"dob": {
"title": "Date of birth",
"description": "Date of Birth of the person addressed in the DGC. ISO 8601 date format restricted to range 1900-2099",
"type": "string",
"pattern": "^(?:((19|20)\\d{2}-\\d{2}-\\d{2})|((19|20)\\d{2})|((19|20)\\d{2}-XX-XX))$",
"examples": [
"1979-04-14",
"1979",
"1979-XX-XX",
]
},
To avoid wrong dates like 1979-99-44 the first regexp part could still be strengthened.
Here you can find an extreme example:
https://stackoverflow.com/questions/15491894/regex-to-validate-date-format-dd-mm-yyyy
You could even try to find the exact regexp that corresponds to the definition of format: 'date' to make it semantically equal:
https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.7.3.1
Miguel
Imo there's no breaking change in this case as dob it is still type: 'string' (it was never a real date type) and the change is fully backwards compatible.
It most certainly was a real date type. And most of the code generated from the schema expects a real date.
This regex will match dates from 1900 to 2099 in the ISO 8601 acceptable formats (YYYY, YYYY-MM, YYYY-MM-DD)
^([1|2][0|9][\d]{2}|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)-(01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))$
I have a large set of test cases should you wish to use it. Let me know :)
Such a breaking change of the schema will require changes to all existing implementations. The schema uses format
date
and therefore requires a full date as YYYY-MM-DD.This will make the schema version go to 2.0.0 as the change is not backwards compatible.
this is not breaking back compat
Such a breaking change of the schema will require changes to all existing implementations. The schema uses format
date
and therefore requires a full date as YYYY-MM-DD.
no
This will make the schema version go to 2.0.0 as the change is not backwards compatible.
no
It was found that it is somewhat common for dates of birth to not be exactly known; e.g. when just a year is known; or the UNHCR style "summer" or "winter" (in NL - about 70k people). As a result - it is common for passports not to contain these. Some countries round this to 01-01 and/or 06-01 -- but not all do. A variation with just the year and month known - but not the date is also happening.
I think you mean "day" and not "date" here in
but not the date is also happening
?
The goal of the DOB is binding with the passport. As the passport ALSO does NOT contain these - the regex needs adjustment.
Yes. Also once again tech people losing focus: the JSON schema is not the final arbirter here, despite the temptation for coders to consider it as such.
From the README.md https://github.com/ehn-digital-green-development/ehn-dgc-schema/blob/next/README.md
The schema in conjunction with business rules (which may also be specific to a Member State) shall ensure conformity to the EU legislation, whereby the burden of conformity lies with the business rules and the DGC JSON schema plays a supporting role in allowing the data to be serialized and deserialized in a flexible, yet structured, manner
This regex will match dates from 1900 to 2099 in the ISO 8601 acceptable formats (YYYY, YYYY-MM, YYYY-MM-DD)
^([1|2][0|9][\d]{2}|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)-(01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))$
I have a large set of test cases should you wish to use it. Let me know :)
Stop. This is going completely the wrong way here! Once again: the JSON is not the final arbiter.
I'll re-state Postel's law once more: "generous with what you accept, strict with what you send"
@gabywh agree with all the things above, but not with the aspect that the schema should only play a supporting rule. The schema is the schema, it is the minimum structure that is defined, having a fuzzy landscape for the schema is very difficult to handle. after the min. structure is established, the business rules apply.
The schema is a formal specification of the business rules and it is correct that the schema is not the arbiter. However, the schema is the technical control and changing the allowed syntax for property to a more permissive format is indeed a breaking change as validating a dob´ such as
2011with schema 1.0.0 will fail. Following semver, it's fine to remove
format: date`, but we have to bump the major version.
Code generators use the JSON schema to generate their data model, and when they find format: date
they interpret that as a ISO8601 date per RFC3339 (section 5.6, full-date
). Trying to parse 2011
or 2021-05
with such code will fail.
We either move to schema v2 or continue to require a full date. I'm fine with either, but the change has to be done correctly.
III) What means "backwards compatibility"exactly - I think there are different perspectives:
While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.
So then we maybe should use regexp according to ICAO Machine Readable Travel Document recommendation with the use of X for unknown elements. For example YYYY-XX-XX when only YYYY is known.
which is not valid ISO8601 and there breaks EU eHN spec
III) What means "backwards compatibility"exactly - I think there are different perspectives:
While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.
So if it is like this, new minor schema versions can only be stricter with respect to existing properties or property-enhancing (as additionalProperties is true by default)?
@gabywh agree with all the things above, but not with the aspect that the schema should only play a supporting rule.
That is its role here, as defined right from the beginning.
The schema is the schema,
umm.. yes agree
it is the minimum structure that is defined,
yes agree
having a fuzzy landscape for the schema is very difficult to handle.
what "fuzzy landscape" ? The schema is defined as it is.
after the min. structure is established, the business rules apply.
yes agree
III) What means "backwards compatibility"exactly - I think there are different perspectives:
While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.
So if it is like this, new minor schema versions can only be stricter with respect to existing properties or property-enhancing (as additionalProperties is true by default)?
semver is clear about when to increment major version: "MAJOR version when you make incompatible API changes" IOW: do you need to re-write the client? In this case no, we still use ISO8601 date. The valid cases that were valid before will still be valid. The relaxation of the spec in the schema means that some cases that were previously invalid will now be considered valid - which is the whole purpose of relaxing the dob regex
The schema is a formal specification of the business rules and it is correct that the schema is not the arbiter.
In which case it cannot be the formal specification of the business rules, by definition.
However, the schema is the technical control
Which means what? I make the decision in my business rules what I am willing to generate / accept.
and changing the allowed syntax for property to a more permissive format is indeed a breaking change as validating a
dob´ such as
2011with schema 1.0.0 will fail. Following semver, it's fine to remove
format: date`, but we have to bump the major version.Code generators use the JSON schema to generate their data model, and when they find
format: date
they interpret that as a ISO8601 date per RFC3339 (section 5.6,full-date
). Trying to parse2011
or2021-05
with such code will fail.
Yes, this the internal contradiction that @miguelzapaton pointed out: it says it accepts ISO8601, but then drops it to only full-date i.e. a subset of ISO8601. That the tooling will always interpret as full date is what @miguelzapaton already pointed out (but unfortunately almost that post was edited away, I'll restore it). So if we really want to keep with minimal changes to tooling, then we drop date - which returns us to my original spec of string + regex only.
original post from @miguelzapaton which for unfortunately was almost completely edited away:
Three more thoughts regarding this issue and the related PR:
I) JSON Schema property definitions like the following have an "inner conflict" as they let one (code generators) expect an ISO 8601 date and thwart it in the next line by an incompatible regex:
"type": "string", "format": "date" "pattern": " .. some regex that allows non ISO 8601 compatible values .."
-> either "format": "date" has to be removed or the pattern has to be adapted to be compatible
II) Values like "1979" or "1979-10" can be parsed as dates but they will most likely be treated as "1979-01-01T00:00:00.000Z" or "1979-10-01T00:00:00.000Z"
-> If a passport states only "1979", a DGC with "1979-01-01" seems contra productive to me
III) What means "backwards compatibility"exactly - I think there are different perspectives:
a) A JSON data artefact created with a former JSON Schema revision still validates successfully with a later revision
-> When "format": "date" is removed and the pattern is still compatible there's no breaking change here
b) A piece of software that handles JSON data artefacts created with a former JSON Schema revision will still be able to handle artefacts created with a later revision
-> When "format": "date" is removed and the pattern doesn't enforce a compatible date format then there's probably a breaking change here
c) ??
Summary on this one so far:
We need to relax the date spec to allow only year, regardless.
Code generation tooling interprets date as full-date even if a full date is not present and that although rfc3339 also defines date-fullyear which would be more suitable in this case.
This looks more like a code generation / tooling issue to me. Assuming you are using ajv for validation then indeed ajv-formats maps date to full-date and has this hard-coded as a regex for full-date in the typescript, so without changing source this can't be modified directly (and I also think this isn't a good approach for other reasons). However, the ajv mechanism can be extended using keyword-with-code-generation-function which may be an option. Yet this is a tool-specific fix. Similarly, the ajv option allowDate also doesn't help us much here.
We can either push this out to tooling and say "tooling" issue, or attempt to accomodate the current state of tooling support and try to make lives easier for implementers. I go for the latter, thus dropping format:date
and back to my original format of string + regex
only.
Only thing left: does this break the API or not? (a la semver defintion)
We either move to schema v2 or continue to require a full date. I'm fine with either, but the change has to be done correctly.
Requiring full date is not an option
Schema version I didn't touch at all in this PR, assuming it would come up as a topic.
Requiring full date is not an option
Then I suggest move the schema to v2.0.0 and do anyOf (string with pattern ^(19|20)\d\d$
, string with format:date
and pattern ^(19|20)\d\d-\d\d-\d\d
).
Requiring full date is not an option
Then I suggest move the schema to v2.0.0
am still not convinced...
and do anyOf (string with pattern
^(19|20)\d\d$
, string withformat:date
and pattern^(19|20)\d\d-\d\d-\d\d
).
you would probably want something like ^(19|20)\d\d-\d\d(-\d\d)?
as that last regex (or {0,1} instead of ? if preferred) as we still need to cater for YYYY-MM (or repeat as two patterns in anyOf()
, and also be consistent with $
too here)
Ahh now I get what @gabywh and @miguelzapaton are saying: the JSON schema is the foundation for everything else. It must be designed to avoid changes. Then the business rules build on top of it - the business rules are much easier to change, so all of the messy real-world things (like different countries using different representations of missing dates) can go there.
It's not designed to be used by an arbitrary code generator.
The schemas do need cleaning up - as Gaby suggests, removing the confusingly detailed specifications (the regex, format of the date) would remove lots of confusion for the developers here.
... Only thing left: does this break the API or not? (a la semver defintion)
@gabywh Great resume!
Maybe the logic of semver is simply not the right one for schema versioning.
A very recommended read:
and do anyOf (string with pattern
^(19|20)\d\d$
, string withformat:date
and pattern^(19|20)\d\d-\d\d-\d\d
).
simplify to
string with pattern
(19|20)\d\d(-\d\d){0,2}`
A very recommended read:
From above link (thanks @miguelzapaton) and very concretely, I would like to propose we move to using: https://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/#schemaver as this just fits a lot better (at least, IMHO) for the task at hand. I can update README.md / wiki FAQ where appropriate
I think there's one important difference:
The guys of SNOWPLOW don't seem to target offline working systems.
If this is still a requirement for DGC systems (?) it would also require the "forward compatibility" that I understand @jschlyter described with:
While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.
So I'm not sure if schemaver does adjust 100% to the needs either as schemaver's main concern is the "backwards-compatibility between the new schema and existing data" and not the "forward-compatibility between an older schema and future data".
So instead of using semver, schemaver, whatever you could just write an own "schema versioning policy" in the FAQ, more or less like SNOWPLOW did in the description of schemaver (chapter 2):
I think semver is clear about major / minor / patch for software, and I find it clear here too (i.e. does this change mean I have to re-write my client code due to the change? If not, then it is not major version number change) but clearly my understanding does not seem to be shared by all.
The big plus for semver is that although it is not a standard in the sense of ISO, it is a widely recognized and accepted standard.
The other extreme would be writing our own which means we come up with something in isolation and end up having to repeatedly explain it to everyone. This is what semver was expressely designed to avoid. Moreover, this would be an entirely separate piece of work and one we can't afford to be dependant on.
The snowplough text falls sonewhere between these two positions, whilst highlighting nicely the differences / issues with handling schema versioning.
Op vr 21 mei 2021 13:16 schreef miguelzapaton @.***>:
I think there's one important difference:
The guys of SNOWPLOW don't seem to target offline working systems.
If this is still a requirement for DGC systems (?) it would also require the "forward compatibility" that I understand @jschlyter https://github.com/jschlyter described with:
While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.
So I'm not sure if schemaver does adjust 100% to the needs either as schemaver's main concern is the "backwards-compatibility between the new schema and existing data" and not the "forward-compatibility between an older schema and future data".
So instead of using semver, schemaver, whatever you could just write an own "schema versioning policy" in the FAQ, more or less like SNOWPLOW did in the description of schemaver (chapter 2):
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ehn-digital-green-development/ehn-dgc-schema/issues/73#issuecomment-845878177, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASHCPHJRJP2L5GLGTET2PM3TOY6ILANCNFSM45EENA2A .
@gabywh if you generate code based on schema v1 it will not be able to read data with dob
containing only a year, or year+month. Nor can schema v1 be used to validate such data.
However, one could handle this is a misinterpretation of the requirements and just do the change anyway with 1.1.0. At this point it is perhaps more important to move forward with a stable schema than to discuss.
Yes, we need the trailing $
Folks - no matter what we do - keep in mind that it is not unusual for people their Identity cards, passports and other civic (or medical) records to not have a full date of birth - but just the year (in The Netherlands; for about 70k people). Simply as it is not known.
It is perfectly possible for a technical standard to be a bit of an ill fit with reality.
But as we issue these DGC's to real people - we need to accommodate them.
@dirkx we agree, the discussion is whether we need to bump to schema v2 because of this breaking change. I suggest that we turn a blind eye and do v1.1.0.
@dirkx we agree, the discussion is whether we need to bump to schema v2 because of this breaking change. I suggest that we turn a blind eye and do v1.1.0.
Is not turning a blind eye and we need to do this properly rather than hack something together. An argument could even be made that this is a patch fix so we could go to 1.0.2 since the original regex was simply broken with respect to the incoming data. Thanks for your suggestion in any case, appreciate your input as something I can take into my own deliberations when making the final decision.
@gabywh if you generate code based on schema v1 it will not be able to read data with
dob
containing only a year, or year+month. Nor can schema v1 be used to validate such data.Right. So this is about old generated code not handling a newer schema. So you generate the new code. You don't need to change the client side and you also do not need to change any data since the new generated code will also handle old schema data. That is what gives you the back-compatibility.
Nor can schema v1 be used to validate such data.
This is the vital part. If you've generated a DGC with schema version NEXT (with the relaxed date format), it is not accepted when validating with schema version 1.0.0. Verifiers who are written for schema version 1.0.0 will reject such a DGC unless updated. Hence, all clients doing payload validation must be updated with new code.
If the schema would have been backwards compatible payload from schema version NEXT would have been accepted by schema version 1.0.0.
Interested folks may have a look at my Proof of Concept for DGC JSON Data Validation with value-extended JSON Schemas:
https://github.com/miguelzapaton/ehn-dgc-schema-json-validator
After two quick validation runs with currently 507 DGC JSON test files you can also check the effect of the schema change from 1.0.1 to 1.1.0 there:
The log file sizes already give an idea of what happened ... less validation errors :)
A diff shows all the details..
It is a standard case that for date of birth either the full date (YYYY-MM-DD) is known, or just an (often estimated) year (YYYY) (e.g. FHIR Patient.birthDate). It can also be the case, but less frequently occuring, that year and month only (YYYY-MM) are known.
The DGC schema must be able to support all cases as valid.
Applying the first part of Postel's law: the DGC schema should accept anything that at least has 4 consecutive decimal digits at the beginning, since we can then interpret this at the minimum as a year. In order to filter out a lot of noise with random values, we can reasonably limit the year range from 1900 to 2099, A simple regex suffices here.
Applying the second part of Postel's law: note as already mentioned elsewhere (https://github.com/ehn-digital-green-development/ehn-dgc-schema/blob/main/README.md, https://github.com/ehn-digital-green-development/ehn-dgc-schema/wiki/FAQ#issuance-of-qr) that it is the primary responsibility of the issuance business rules to generate clean data for submission to the DGC payload.
The current regex for dob is
"(19|20)\\d{2}-\\d{2}-\\d{2}"
and it now needs to be"^(19|20)\\d{2}"