ehn-dcc-development / eu-dcc-schema

Schema for the ehn DCC payload
Apache License 2.0
164 stars 59 forks source link

UVCI checksum calculation #63

Closed martin-lindstrom closed 3 years ago

martin-lindstrom commented 3 years ago

According to the wiki the Luhn-Mod-N charset used during calculation is ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#/:. This means that the # character whose only purpose is to be used as the delimiter for the UVCI and its checksum may end up as the checksum character, e.g.:

01:SE:EHM/123456789KOOZC6RM7VZ##

Is this really want we want? I think it is sensible to exclude the # character from the charset since it can not appear anywhere in the UVCI.

dirkx commented 3 years ago

Agreed.

jkiddo commented 3 years ago

Can someone verify that the example calculations is indeed correct? I’ve tried to calculate the checksum using the algorithms stated at https://en.wikipedia.org/wiki/Luhn_mod_N_algorithm and my checksum character ends up being β€œ2” when calculated on the string β€œ01 LUX/18737512422923"

jkiddo commented 3 years ago

With reference to the example on https://github.com/ehn-digital-green-development/ehn-dgc-schema/wiki/FAQ#example

I can verify that that 01:SE:EHM/123456789KOOZC6RM7VZ ends up having the checksum char as #

jkiddo commented 3 years ago

@martin-lindstrom

martin-lindstrom commented 3 years ago

Yes, but I don't think that we should include the '#' character in the set. It's not allowed anywhere else than as the delimiter for the checksum character.

jkiddo commented 3 years ago

Right ... makes sense - but will you crosscheck that the example is correct? I don't think it is

martin-lindstrom commented 3 years ago

No.

01LUX/18737512422923 -> Z if we include the # in the set and E if you don't use it.

  /** The valid characters of an UVCI (excluding delimiter characters). */
  private static final char[] UVCI_CHAR_SET = {
      'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
      'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
      'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

  /** The valid characters of an UVCI including delimiters. */
  private final static String UVCI_CHAR_SET_W_DELIMS = new String(UVCI_CHAR_SET) + "#/:";

  /**
   * Implementation stolen from https://en.wikipedia.org/wiki/Luhn_mod_N_algorithm#Algorithm_in_Java.
   * 
   * @param input
   *          the input
   * @return the control character
   */
  private static char generateLuhnModNCheckCharacter(final String input) {
    int factor = 2;
    int sum = 0;
    int n = UVCI_CHAR_SET_W_DELIMS.length();

    // Starting from the right and working leftwards is easier since
    // the initial "factor" will always be "2".
    for (int i = input.length() - 1; i >= 0; i--) {
      int codePoint = UVCI_CHAR_SET_W_DELIMS.indexOf(input.charAt(i));
      int addend = factor * codePoint;

      // Alternate the "factor" that each "codePoint" is multiplied by
      factor = (factor == 2) ? 1 : 2;

      // Sum the digits of the "addend" as expressed in base "n"
      addend = (addend / n) + (addend % n);
      sum += addend;
    }

    // Calculate the number that must be added to the "sum"
    // to make it divisible by "n".
    int remainder = sum % n;
    int checkCodePoint = (n - remainder) % n;

    return UVCI_CHAR_SET_W_DELIMS.charAt(checkCodePoint);
  }
jkiddo commented 3 years ago

So 'No' - as in that you agree that the example is wrong?

martin-lindstrom commented 3 years ago

Yes.

jkiddo commented 3 years ago

@gabywh πŸ‘†

jkiddo commented 3 years ago

.... and the format. Does that follow the urn form as in eg. https://github.com/eu-digital-green-certificates/dgc-testdata/blob/main/common/2DCode/raw/CO2.json#L22 - (which seems to lack the checksum separator and therefor also the checksum character)

PeterEbraert commented 3 years ago

Does the / need to be used between version country code and opaque part?

We are currently going for this format: 01BEVL7890ABCDEFGHIJ12345678#3 (where 3 is the Luhm checksum for the complete part before the #)

-- Don't know if the 3 above is actually the correct checksum char. I am just stating the above, for discussing about the format.

gunnarif commented 3 years ago

Hi! I'm struggling to figure out what is right in this. I have this code, sometimes it matches https://dotnetfiddle.net/xDhpaY

hs14m2b commented 3 years ago

Is there any update on correcting the example and confirming the exact charset? My calculations (using node) agree with Martin's above.

martin-lindstrom commented 3 years ago

@gabywh @dirkx Update the wiki and remove the #from the character set and provide correct examples?

Perhaps also elaborate on the urn:uvci prefix.

jkiddo commented 3 years ago

πŸ‘†is the removal of '#' from the character set considered as part of 1.0.0 or is that part of the next version?

gabywh commented 3 years ago

is the removal of '#' from the character set considered as part of 1.0.0 or is that part of the next version?

won't ever be part of 1.0.0 - could be patch 1.0.1 or probably in next minor 1.1.0. I'm going to try to look at this today (2021-05-14).

gabywh commented 3 years ago

.... and the format. Does that follow the urn form as in eg. https://github.com/eu-digital-green-certificates/dgc-testdata/blob/main/common/2DCode/raw/CO2.json#L22 - (which seems to lack the checksum separator and therefor also the checksum character)

A previous format, I think. Am reviewing all of these as part of #59.

gabywh commented 3 years ago

@gabywh @dirkx Update the wiki and remove the #from the character set and provide correct examples?

Perhaps also elaborate on the urn:uvci prefix.

As per Annex 2 in https://ec.europa.eu/health/sites/default/files/ehealth/docs/vaccination-proof_interoperability-guidelines_en.pdf which - as far as I understand it - has: [2 digit version]:[country code - where possible 2 digit ISO 3166]:[payload/pay/load]#[checksum] field separators:

The prefix URN:UVCI: shall be present when stored in the JSON schema, but is not required to be displayed for the human-readable / printed version

gregsons commented 3 years ago

@gabywh or anyone, question concerning the checksum part - is it required to provide? In the QR code examples not every country provides it now.

dirkx commented 3 years ago

The checksum was specifically ment for human entry - e.g. from paper.

And not anticipated in the digital version - as all data is generally going through digitally signed channels - and thus integrity protected.

The documents are not terribly clear though (as these detailed appendixes did not make it in version 1.0 at the time).

gabywh commented 3 years ago

Prefix:

URN:UVCI:[the rest...]

Notes

  1. The term UVCI is used - where V refers to vaccination - also for Test and Recovery, since all certificate ids are of the form UVCI. Vaccination is where it all started and the term UVCI just stuck.
  2. The checksum is calculated across the whole string - including the URN:UVCI: prefix - so the prefix has to be in uppercase
thinkberg commented 3 years ago

What is the value of urn:uvci except more bytes? If the field is clearly defined it is unnecessary bloat. Also, it is not clearly defined, so that some include it, other don't. I still don't read from the specs that this prefix is expected.

thinkberg commented 3 years ago

Okay, someone put it in and it is till totally useless. We fight for bytes and then we include an identifier that has an unnecessary prefix?

dirkx commented 3 years ago

The appendix details why this prefix needs to be there - and how it helps with resolution in scenario 3 - and how it helps guarantee global uniqueness in an open system - basically capturing the industry best practices that the internet has learned the hard way(rfc 1736, rfc 1737, rfc 8141) . Keeping this field to be a valid URI therefore is valuable.

You could argue of course that this is a short lived system in a closed world. Sometimes that works - but on the internet open-systems tend to win from closed world. Or, worse, give rise to a open world replacement.

gabywh commented 3 years ago

Fixed with commit 2af8559