eu-digital-identity-wallet / eudi-doc-architecture-and-reference-framework

The European Digital Identity Wallet
https://eu-digital-identity-wallet.github.io/eudi-doc-architecture-and-reference-framework/
Other
369 stars 55 forks source link

Character set of PID tstr fields such as family_name should be explicitly defined as Unicode #157

Open joelposti opened 3 months ago

joelposti commented 3 months ago

Character set of the PID's tstr attributes, such as family_name and given_name, is currently undefined in PID Rule Book 1.0.0. It should be explicitly defined. Considering that so much of the PID Rule Book is inspired by ISO/IEC 18013-5 Mobile driving licence specification there is a real risk that PID providers and other parties only implement support for latin1 character set as that is defined in ISO/IEC 18013-5 as the character set for family_name and given_name attributes. I and in extension DVV (the Finnish Digital and Population Data Services Agency) that I represent, object to latin1 being the character set for PID's tstr attributes, because latin1 is too constrained to allow correct representation of Sámi names (among others).

We propose that the character set of the PID's tstr attributes should be defined explicitly in the PID Rule Book and that the character set should be Unicode in its full range. It would be even better if the character set was explicitly defined in the ARF for all EUDIW-compliant attestations.

It is important to understand the difference between a character set and a character encoding. A character set is a set of characters that are available to use. A character encoding, on the other hand, is the way those characters are represented in memory as binary. Character encoding of the PID's tstr attributes is implicitly UTF-8, because both the JSON specification and the CBOR specification define UTF-8 as the character encoding. PID Rule Book also defines UTF-8 as the character encoding for the tstr attributes in the SD-JWT PID. However, the character set is not defined which we fear results in latin1 being inherited from ISO/IEC 18013-5.

digeorgi commented 2 weeks ago

Thank you very much for your input.

For mdoc format, the character set of the PID's tstr attributes is inherited from ISO/IEC 18013-5. For other format, where the character set is not inherited from existing standards (e.g., SD-JWT VC), the character set of the PID's tstr attributes will be defined explicitly in the PID Rule Book as Unicode in its full range.

joelposti commented 2 weeks ago

Thank you very much for your input.

For mdoc format, the character set of the PID's tstr attributes is inherited from ISO/IEC 18013-5. For other format, where the character set is not inherited from existing standards (e.g., SD-JWT VC), the character set of the PID's tstr attributes will be defined explicitly in the PID Rule Book as Unicode in its full range.

Thank you for your response!

We are happy that the character set of SD-JWT PID tstr attributes will be Unicode in its full range.

However, we are not so happy with the notion that the character set of mdoc PID tstr attributes will be inherited from ISO/IEC 18013-5. Could you provide argumentation as to why you would like to constrain mdoc PID tstr attributes to latin1? What is being defined here is new technology and a completely new type of credential, the PID. Why do we need to consider legacy character sets such as latin1?