Character set (syntactical interoperability)

dimi-schepers commented 3 years ago

During the Core Vocs webinar dd. 2021-04-23, a proposition was made to also specify the character set that should be used. In Germany, Latin is used – with a subset of the 6400 characters – but they would also like to see the possibility to indicate other character sets (e.g. Bulgarian) so that syntactical interoperability is also covered. "What about restricting the possible characters to the legally binding transliteration agreements?"

It was argued that if the text would be defined as UTF8, this would maybe be easier instead of defining the character set. It was clarified that the question is not about character encoding as in UTF8, it is given that the encoding is UTF-8. The proposition is about specifying, in addition to language, also the set of letters (e.g. Chinese traditional vs Chinese simplified - both Chinese, but with different character sets, both technically included in UTF-8).

A German standardisation initiative called DIN SPEC 91379 could serve as inspiration: https://www.din.de/de/wdc-beuth:din21:301228458.

It was added that such discussions and decisions should also reflect the vision of the world and therefore should be spread outside of the IT and semantic communities. It was mentioned that it could be important to look at what the social changes are before hard coding these aspects in a specification. This is also valid for the gender discussion. A link from W3C regarding different naming conventions around the world was shared: https://www.w3.org/International/questions/qa-personal-names.en

frank-steimke commented 2 years ago

I'd like to add some remarks about the DIN 91379 character set

The main Motivation is the fact, that we simply cannot expect that everyone can handle, or even recognize, all characters from the Unicode character set. Documents with foreign character sets (Unicode scripts like Arabic, Greek, Thai ...) must be translated or transcripted.
I am sure, that the same applies for the exchange of structured digital documents. For example, a request for an evidence of a natural person with FamilyName=Θεοδωράκης and given Name=Μίκης won't give any results in an german registry, whether or not a Mikis Theodorakis has a record in the register.
Therefore, DIN 91379 covers all ISO recommendations for transcription from other character sets like ISO 9, ISO 233 etc. into Unicode Latin script.
It is a well-defined Subset of Unicode. All Unicode Rules apply to the DIN 91379 Subset as well. Including the recommendation of UTF8 encoding.
Only characters that are "in use" and therefore neccessary are included. That's the reason, why we did not simply define the subset as "all Unicode Characters from the Unicode Latin script". This would include phonetic characters or ancient latin characters.
We expect DIN 91739 to become a official standard from the german national standardization office (DIN-Deutsches Institut für Normung") in December 2021
Next step will be a proposal für CEN. Perhaps on a fast lane, based on the national standard in Germany

EmidioStani commented 2 years ago

This issue relates to character set in a Document In XML world the charset at the top level of the XML document: <?xml version="1.0" encoding="UTF-8"?> In HTML the charset can be defined inside the html document:

Thus all the content of such document will follow the character encoding.

There could be 2 ways: 1) adding a property at the foaf:Document class indicating the charset 2) including a relation between foaf:Document and cnt:Content classes (where cnt:Content class can be found in https://www.w3.org/TR/Content-in-RDF10/#ContentClass)

EmidioStani commented 1 year ago

As there is no document associated in Core Person, this issue could be closed, however implementers of CCCEV, when implementing an Evidence as a document, could consider to implementing as suggested above

SEMICeu / Core-Person-Vocabulary

Character set (syntactical interoperability) #26