ehn-dcc-development / eu-dcc-schema

Schema for the ehn DCC payload
Apache License 2.0
164 stars 59 forks source link

relax dob regex to check year only #73

Closed gabywh closed 3 years ago

gabywh commented 3 years ago

It is a standard case that for date of birth either the full date (YYYY-MM-DD) is known, or just an (often estimated) year (YYYY) (e.g. FHIR Patient.birthDate). It can also be the case, but less frequently occuring, that year and month only (YYYY-MM) are known.

The DGC schema must be able to support all cases as valid.

Applying the first part of Postel's law: the DGC schema should accept anything that at least has 4 consecutive decimal digits at the beginning, since we can then interpret this at the minimum as a year. In order to filter out a lot of noise with random values, we can reasonably limit the year range from 1900 to 2099, A simple regex suffices here.

Applying the second part of Postel's law: note as already mentioned elsewhere (https://github.com/ehn-digital-green-development/ehn-dgc-schema/blob/main/README.md, https://github.com/ehn-digital-green-development/ehn-dgc-schema/wiki/FAQ#issuance-of-qr) that it is the primary responsibility of the issuance business rules to generate clean data for submission to the DGC payload.

The current regex for dob is "(19|20)\\d{2}-\\d{2}-\\d{2}" and it now needs to be "^(19|20)\\d{2}"

martin-lindstrom commented 3 years ago

So you mean that a date of birth only containing the year should be allowed? When was this decision made?

Informing @sondaica also...

jschlyter commented 3 years ago

Such a breaking change of the schema will require changes to all existing implementations. The schema uses format date and therefore requires a full date as YYYY-MM-DD.

This will make the schema version go to 2.0.0 as the change is not backwards compatible.

sondaica commented 3 years ago

How is that according to the regulation? It states date of birth.

dirkx commented 3 years ago

It was found that it is somewhat common for dates of birth to not be exactly known; e.g. when just a year is known; or the UNHCR style "summer" or "winter" (in NL - about 70k people).

As a result - it is common for passports not to contain these. Some countries round this to 01-01 and/or 06-01 -- but not all do. A variation with just the year and month known - but not the date is also happening.

The goal of the DOB is binding with the passport.

As the passport ALSO does NOT contain these - the regex needs adjustment.

sondaica commented 3 years ago

So then we maybe should use regexp according to ICAO Machine Readable Travel Document recommendation with the use of X for unknown elements. For example YYYY-XX-XX when only YYYY is known.

martin-lindstrom commented 3 years ago

We also need to remove the "format": "date" from the definition of dob.

This small change will have rather big impact on all implementations. And I agree with @jschlyter that this is a breaking change.

miguelzapaton commented 3 years ago

This one should work for all and is fully backwards compatible:


"dob": {
      "title": "Date of birth",
      "description": "Date of Birth of the person addressed in the DGC. ISO 8601 date format restricted to range 1900-2099",
      "type": "string",
      "oneOf": [
        {
          "format": "date",
          "pattern": "^(?:((19|20)\\d{2}-\\d{2}-\\d{2})|((19|20)\\d{2}))$",
          "examples": [
            "1979-04-14",
            "1979"
          ]
        },
        {
          "pattern": "^((19|20)\\d{2}-XX-XX)$",
          "examples": [
            "1979-XX-XX"
          ]
        }
      ]
    },
jschlyter commented 3 years ago

@miguelzapaton that is better, but since the date format is no longer mandatory it is still a breaking change. I see no other option than to move to 2.0.0 (or break the semantic version scheme).

miguelzapaton commented 3 years ago

@miguelzapaton that is better, but since the date format is no longer mandatory it is still a breaking change. I see no other option than to move to 2.0.0 (or break the semantic version scheme).

If format: 'date' is removed then a date like 1979-99-44 would be validated correctly.

Imo there's no breaking change in this case as dob it is still type: 'string' (it was never a real date type) and the change is fully backwards compatible.

oneOf would no longer be needed and it could be defined like this (untested!):

"dob": {
      "title": "Date of birth",
      "description": "Date of Birth of the person addressed in the DGC. ISO 8601 date format restricted to range 1900-2099",
      "type": "string",
      "pattern": "^(?:((19|20)\\d{2}-\\d{2}-\\d{2})|((19|20)\\d{2})|((19|20)\\d{2}-XX-XX))$",
      "examples": [
        "1979-04-14",
        "1979",
        "1979-XX-XX",
      ]
    },

To avoid wrong dates like 1979-99-44 the first regexp part could still be strengthened.

Here you can find an extreme example:

https://stackoverflow.com/questions/15491894/regex-to-validate-date-format-dd-mm-yyyy

You could even try to find the exact regexp that corresponds to the definition of format: 'date' to make it semantically equal:

https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.7.3.1

Miguel

martin-lindstrom commented 3 years ago

Imo there's no breaking change in this case as dob it is still type: 'string' (it was never a real date type) and the change is fully backwards compatible.

It most certainly was a real date type. And most of the code generated from the schema expects a real date.

ryanbnl commented 3 years ago

This regex will match dates from 1900 to 2099 in the ISO 8601 acceptable formats (YYYY, YYYY-MM, YYYY-MM-DD)

^([1|2][0|9][\d]{2}|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)-(01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))$

I have a large set of test cases should you wish to use it. Let me know :)

gabywh commented 3 years ago

Such a breaking change of the schema will require changes to all existing implementations. The schema uses format date and therefore requires a full date as YYYY-MM-DD.

This will make the schema version go to 2.0.0 as the change is not backwards compatible.

this is not breaking back compat

gabywh commented 3 years ago

Such a breaking change of the schema will require changes to all existing implementations. The schema uses format date and therefore requires a full date as YYYY-MM-DD.

no

This will make the schema version go to 2.0.0 as the change is not backwards compatible.

no

It was found that it is somewhat common for dates of birth to not be exactly known; e.g. when just a year is known; or the UNHCR style "summer" or "winter" (in NL - about 70k people). As a result - it is common for passports not to contain these. Some countries round this to 01-01 and/or 06-01 -- but not all do. A variation with just the year and month known - but not the date is also happening.

I think you mean "day" and not "date" here in

but not the date is also happening

?

The goal of the DOB is binding with the passport. As the passport ALSO does NOT contain these - the regex needs adjustment.

Yes. Also once again tech people losing focus: the JSON schema is not the final arbirter here, despite the temptation for coders to consider it as such.

From the README.md https://github.com/ehn-digital-green-development/ehn-dgc-schema/blob/next/README.md

The schema in conjunction with business rules (which may also be specific to a Member State) shall ensure conformity to the EU legislation, whereby the burden of conformity lies with the business rules and the DGC JSON schema plays a supporting role in allowing the data to be serialized and deserialized in a flexible, yet structured, manner

gabywh commented 3 years ago

This regex will match dates from 1900 to 2099 in the ISO 8601 acceptable formats (YYYY, YYYY-MM, YYYY-MM-DD)

^([1|2][0|9][\d]{2}|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)|[1|2][0|9][\d]{2}-(01|02|03|04|05|06|07|08|09|10|11|12)-(01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))$

I have a large set of test cases should you wish to use it. Let me know :)

Stop. This is going completely the wrong way here! Once again: the JSON is not the final arbiter.

gabywh commented 3 years ago

I'll re-state Postel's law once more: "generous with what you accept, strict with what you send"

asitplus-pteufl commented 3 years ago

@gabywh agree with all the things above, but not with the aspect that the schema should only play a supporting rule. The schema is the schema, it is the minimum structure that is defined, having a fuzzy landscape for the schema is very difficult to handle. after the min. structure is established, the business rules apply.

jschlyter commented 3 years ago

The schema is a formal specification of the business rules and it is correct that the schema is not the arbiter. However, the schema is the technical control and changing the allowed syntax for property to a more permissive format is indeed a breaking change as validating a dob´ such as2011with schema 1.0.0 will fail. Following semver, it's fine to removeformat: date`, but we have to bump the major version.

Code generators use the JSON schema to generate their data model, and when they find format: date they interpret that as a ISO8601 date per RFC3339 (section 5.6, full-date). Trying to parse 2011 or 2021-05 with such code will fail.

We either move to schema v2 or continue to require a full date. I'm fine with either, but the change has to be done correctly.

miguelzapaton commented 3 years ago

III) What means "backwards compatibility"exactly - I think there are different perspectives:

While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.

gabywh commented 3 years ago

So then we maybe should use regexp according to ICAO Machine Readable Travel Document recommendation with the use of X for unknown elements. For example YYYY-XX-XX when only YYYY is known.

which is not valid ISO8601 and there breaks EU eHN spec

miguelzapaton commented 3 years ago

III) What means "backwards compatibility"exactly - I think there are different perspectives:

While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.

So if it is like this, new minor schema versions can only be stricter with respect to existing properties or property-enhancing (as additionalProperties is true by default)?

gabywh commented 3 years ago

@gabywh agree with all the things above, but not with the aspect that the schema should only play a supporting rule.

That is its role here, as defined right from the beginning.

The schema is the schema,

umm.. yes agree

it is the minimum structure that is defined,

yes agree

having a fuzzy landscape for the schema is very difficult to handle.

what "fuzzy landscape" ? The schema is defined as it is.

after the min. structure is established, the business rules apply.

yes agree

gabywh commented 3 years ago

III) What means "backwards compatibility"exactly - I think there are different perspectives:

While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.

So if it is like this, new minor schema versions can only be stricter with respect to existing properties or property-enhancing (as additionalProperties is true by default)?

semver is clear about when to increment major version: "MAJOR version when you make incompatible API changes" IOW: do you need to re-write the client? In this case no, we still use ISO8601 date. The valid cases that were valid before will still be valid. The relaxation of the spec in the schema means that some cases that were previously invalid will now be considered valid - which is the whole purpose of relaxing the dob regex

gabywh commented 3 years ago

The schema is a formal specification of the business rules and it is correct that the schema is not the arbiter.

In which case it cannot be the formal specification of the business rules, by definition.

However, the schema is the technical control

Which means what? I make the decision in my business rules what I am willing to generate / accept.

and changing the allowed syntax for property to a more permissive format is indeed a breaking change as validating a dob´ such as2011with schema 1.0.0 will fail. Following semver, it's fine to removeformat: date`, but we have to bump the major version.

Code generators use the JSON schema to generate their data model, and when they find format: date they interpret that as a ISO8601 date per RFC3339 (section 5.6, full-date). Trying to parse 2011 or 2021-05 with such code will fail.

Yes, this the internal contradiction that @miguelzapaton pointed out: it says it accepts ISO8601, but then drops it to only full-date i.e. a subset of ISO8601. That the tooling will always interpret as full date is what @miguelzapaton already pointed out (but unfortunately almost that post was edited away, I'll restore it). So if we really want to keep with minimal changes to tooling, then we drop date - which returns us to my original spec of string + regex only.

gabywh commented 3 years ago

original post from @miguelzapaton which for unfortunately was almost completely edited away:


Three more thoughts regarding this issue and the related PR:

I) JSON Schema property definitions like the following have an "inner conflict" as they let one (code generators) expect an ISO 8601 date and thwart it in the next line by an incompatible regex:

"type": "string", "format": "date" "pattern": " .. some regex that allows non ISO 8601 compatible values .."

-> either "format": "date" has to be removed or the pattern has to be adapted to be compatible

II) Values like "1979" or "1979-10" can be parsed as dates but they will most likely be treated as "1979-01-01T00:00:00.000Z" or "1979-10-01T00:00:00.000Z"

-> If a passport states only "1979", a DGC with "1979-01-01" seems contra productive to me

III) What means "backwards compatibility"exactly - I think there are different perspectives:

a) A JSON data artefact created with a former JSON Schema revision still validates successfully with a later revision

-> When "format": "date" is removed and the pattern is still compatible there's no breaking change here

b) A piece of software that handles JSON data artefacts created with a former JSON Schema revision will still be able to handle artefacts created with a later revision

-> When "format": "date" is removed and the pattern doesn't enforce a compatible date format then there's probably a breaking change here

c) ??

gabywh commented 3 years ago

Summary on this one so far:

  1. We need to relax the date spec to allow only year, regardless.

  2. Code generation tooling interprets date as full-date even if a full date is not present and that although rfc3339 also defines date-fullyear which would be more suitable in this case.

This looks more like a code generation / tooling issue to me. Assuming you are using ajv for validation then indeed ajv-formats maps date to full-date and has this hard-coded as a regex for full-date in the typescript, so without changing source this can't be modified directly (and I also think this isn't a good approach for other reasons). However, the ajv mechanism can be extended using keyword-with-code-generation-function which may be an option. Yet this is a tool-specific fix. Similarly, the ajv option allowDate also doesn't help us much here.

We can either push this out to tooling and say "tooling" issue, or attempt to accomodate the current state of tooling support and try to make lives easier for implementers. I go for the latter, thus dropping format:date and back to my original format of string + regex only.

Only thing left: does this break the API or not? (a la semver defintion)

gabywh commented 3 years ago

We either move to schema v2 or continue to require a full date. I'm fine with either, but the change has to be done correctly.

Requiring full date is not an option

Schema version I didn't touch at all in this PR, assuming it would come up as a topic.

jschlyter commented 3 years ago

Requiring full date is not an option

Then I suggest move the schema to v2.0.0 and do anyOf (string with pattern ^(19|20)\d\d$, string with format:date and pattern ^(19|20)\d\d-\d\d-\d\d).

gabywh commented 3 years ago

Requiring full date is not an option

Then I suggest move the schema to v2.0.0

am still not convinced...

and do anyOf (string with pattern ^(19|20)\d\d$, string with format:date and pattern ^(19|20)\d\d-\d\d-\d\d).

you would probably want something like ^(19|20)\d\d-\d\d(-\d\d)? as that last regex (or {0,1} instead of ? if preferred) as we still need to cater for YYYY-MM (or repeat as two patterns in anyOf(), and also be consistent with $ too here)

ryanbnl commented 3 years ago

Ahh now I get what @gabywh and @miguelzapaton are saying: the JSON schema is the foundation for everything else. It must be designed to avoid changes. Then the business rules build on top of it - the business rules are much easier to change, so all of the messy real-world things (like different countries using different representations of missing dates) can go there.

It's not designed to be used by an arbitrary code generator.

The schemas do need cleaning up - as Gaby suggests, removing the confusingly detailed specifications (the regex, format of the date) would remove lots of confusion for the developers here.

miguelzapaton commented 3 years ago

... Only thing left: does this break the API or not? (a la semver defintion)

@gabywh Great resume!

Maybe the logic of semver is simply not the right one for schema versioning.

A very recommended read:

https://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/

gabywh commented 3 years ago

and do anyOf (string with pattern ^(19|20)\d\d$, string with format:date and pattern ^(19|20)\d\d-\d\d-\d\d).

simplify to string with pattern(19|20)\d\d(-\d\d){0,2}`

gabywh commented 3 years ago

A very recommended read:

https://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/

From above link (thanks @miguelzapaton) and very concretely, I would like to propose we move to using: https://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/#schemaver as this just fits a lot better (at least, IMHO) for the task at hand. I can update README.md / wiki FAQ where appropriate

miguelzapaton commented 3 years ago

I think there's one important difference:

The guys of SNOWPLOW don't seem to target offline working systems.

If this is still a requirement for DGC systems (?) it would also require the "forward compatibility" that I understand @jschlyter described with:

While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.

So I'm not sure if schemaver does adjust 100% to the needs either as schemaver's main concern is the "backwards-compatibility between the new schema and existing data" and not the "forward-compatibility between an older schema and future data".

So instead of using semver, schemaver, whatever you could just write an own "schema versioning policy" in the FAQ, more or less like SNOWPLOW did in the description of schemaver (chapter 2):

https://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/

gabywh commented 3 years ago

I think semver is clear about major / minor / patch for software, and I find it clear here too (i.e. does this change mean I have to re-write my client code due to the change? If not, then it is not major version number change) but clearly my understanding does not seem to be shared by all.

The big plus for semver is that although it is not a standard in the sense of ISO, it is a widely recognized and accepted standard.

The other extreme would be writing our own which means we come up with something in isolation and end up having to repeatedly explain it to everyone. This is what semver was expressely designed to avoid. Moreover, this would be an entirely separate piece of work and one we can't afford to be dependant on.

The snowplough text falls sonewhere between these two positions, whilst highlighting nicely the differences / issues with handling schema versioning.

Op vr 21 mei 2021 13:16 schreef miguelzapaton @.***>:

I think there's one important difference:

The guys of SNOWPLOW don't seem to target offline working systems.

If this is still a requirement for DGC systems (?) it would also require the "forward compatibility" that I understand @jschlyter https://github.com/jschlyter described with:

While staying at the same major version, it must be possible to verify all future payloads with any earlier versions of the schema.

So I'm not sure if schemaver does adjust 100% to the needs either as schemaver's main concern is the "backwards-compatibility between the new schema and existing data" and not the "forward-compatibility between an older schema and future data".

So instead of using semver, schemaver, whatever you could just write an own "schema versioning policy" in the FAQ, more or less like SNOWPLOW did in the description of schemaver (chapter 2):

https://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ehn-digital-green-development/ehn-dgc-schema/issues/73#issuecomment-845878177, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASHCPHJRJP2L5GLGTET2PM3TOY6ILANCNFSM45EENA2A .

jschlyter commented 3 years ago

@gabywh if you generate code based on schema v1 it will not be able to read data with dob containing only a year, or year+month. Nor can schema v1 be used to validate such data.

However, one could handle this is a misinterpretation of the requirements and just do the change anyway with 1.1.0. At this point it is perhaps more important to move forward with a stable schema than to discuss.

miguelzapaton commented 3 years ago

@gabywh No trailing $ ?

https://regexr.com/5ti7a

https://regexr.com/5tgqa

jschlyter commented 3 years ago

Yes, we need the trailing $

dirkx commented 3 years ago

Folks - no matter what we do - keep in mind that it is not unusual for people their Identity cards, passports and other civic (or medical) records to not have a full date of birth - but just the year (in The Netherlands; for about 70k people). Simply as it is not known.

It is perfectly possible for a technical standard to be a bit of an ill fit with reality.

But as we issue these DGC's to real people - we need to accommodate them.

jschlyter commented 3 years ago

@dirkx we agree, the discussion is whether we need to bump to schema v2 because of this breaking change. I suggest that we turn a blind eye and do v1.1.0.

gabywh commented 3 years ago

@dirkx we agree, the discussion is whether we need to bump to schema v2 because of this breaking change. I suggest that we turn a blind eye and do v1.1.0.

Is not turning a blind eye and we need to do this properly rather than hack something together. An argument could even be made that this is a patch fix so we could go to 1.0.2 since the original regex was simply broken with respect to the incoming data. Thanks for your suggestion in any case, appreciate your input as something I can take into my own deliberations when making the final decision.

gabywh commented 3 years ago

@gabywh if you generate code based on schema v1 it will not be able to read data with dob containing only a year, or year+month. Nor can schema v1 be used to validate such data.

Right. So this is about old generated code not handling a newer schema. So you generate the new code. You don't need to change the client side and you also do not need to change any data since the new generated code will also handle old schema data. That is what gives you the back-compatibility.

jschlyter commented 3 years ago

Nor can schema v1 be used to validate such data.

This is the vital part. If you've generated a DGC with schema version NEXT (with the relaxed date format), it is not accepted when validating with schema version 1.0.0. Verifiers who are written for schema version 1.0.0 will reject such a DGC unless updated. Hence, all clients doing payload validation must be updated with new code.

If the schema would have been backwards compatible payload from schema version NEXT would have been accepted by schema version 1.0.0.

miguelzapaton commented 3 years ago

Interested folks may have a look at my Proof of Concept for DGC JSON Data Validation with value-extended JSON Schemas:

https://github.com/miguelzapaton/ehn-dgc-schema-json-validator

After two quick validation runs with currently 507 DGC JSON test files you can also check the effect of the schema change from 1.0.1 to 1.1.0 there:

Validation run with schema version 1.0.1

https://github.com/miguelzapaton/ehn-dgc-schema-json-validator/blob/main/log/dgc-testdata-latest/validation_example_run_20210526_schema_version_1-0-1/validation_run_20210526_111304.log

Validation run with schema version 1.1.0

https://github.com/miguelzapaton/ehn-dgc-schema-json-validator/blob/main/log/dgc-testdata-latest/validation_example_run_20210526_schema_version_1-1-0/validation_run_20210526_111214.log

The log file sizes already give an idea of what happened ... less validation errors :)

A diff shows all the details..