SEMICeu / Core-Person-Vocabulary

This is the issue tracker for the maintenance of Core Person Vocabulary
15 stars 4 forks source link

Consolidated diagrams Core Vocabularies #37

Closed aleksandralavreneva closed 1 year ago

aleksandralavreneva commented 2 years ago

As introduced during the fifth webinar, two consolidated diagrams have been produced combining all core vocabularies: Core Person Vocabulary, Core Location Vocabulary, Core Business Vocabulary and Core Public Organization Vocabulary. These diagrams intend to give an overview of the classes and properties of the different vocabularies.

The Consolidated diagram in an exhaustive manner while the Simplified version focuses on the main concepts of each vocabulary and their connections.

With this issue, we would like to invite you to provide feedback on these diagrams.

janbmgo commented 2 years ago

Thanks for sharing, especially also the consolidated diagrams. Following version one I have done some consolidation (privately) and included those V1 consolidated diagrams in (analysis) work for (government) customers. One striking difference is the omission in V2 of the attribute Identifier type. As I understand it, in V2 this type is assumed to be (somehow) part of the identifier, or derived from issuing authority (let's call this implicit type). On the other hand the identifier class is used for identifying persons, legal entities, public organisations, (and beyond that bank accounts, vehicles, etc). This makes it very attractive and practical to have an explicit identifier type, and use identifier type values in constraints on relations between the identified "core class" and the Identifiers (e.g. VAT nr for Legal Entity - at first "business entities", but in Belgium, also public organisations have a VAT number). Explicit or implicit identifier type is a first issue I have with V2. A second issue is concerned with the dates (dateOfIssue, ...) : For a VAT number and for instance in Belgium, National Insurance Number, nobody knows the exact date of issue. In the common case it is "short after" the date of business creation , or short after the birth, and validity is beyond death or end-of-life. For immigrants receiving the number is part of their "entry procedures" (call this a "direct and permanent" identifier, which anglo-saxon countries refused to use, until recently with the TIN). But contrast this with a passport, a visa, or id card (here the identifier is of the document itself): date of issue and expiry date are relevant (and semantically it relates to the evidence concept - evidence of citizenship, ...). In situations without direct identifiers, such as international travel, "indirect and non-permanent identifiers" are common place (the identifier and type of the document helps to identify the entity). My recommendation is that a "pragmatic merging" of properties for the class Identifier includes:

However:

Use case to consider: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32018R1240

fititnt commented 2 years ago

This makes it very attractive and practical to have an explicit identifier type

I love such ideas! While I'm new here, I think this idea of explicit identifier is more what they call "application profile" (in the sense of direct use). This means that even for countries which can use English, the exact term varies (sometimes often depending on the issuing organization).

For immigrants receiving the number is part of their "entry procedures" (...) (which countries)... refused to use

Another relevant point is some subjects are sensitive (not privacy, but collaborators become targets), so in addition of be viable (have data conventions with documentation which allow be implemented de facto) on these issues even well intentioned workers from government or companies may be forced (often by bureaucracy, comply with local laws, wait for approval,...) to delay because this often can be used to track misbehavior. This is quite complicated even if leadership (first minister/president) wants it, but at the sub national level or at the organizational level (think police departments not implementing because if they do while others do not, they become news and are punished).

the relationship between evidence and (indirect) identification should be explored further;

I think this is CCCVE. Until 4 days ago on https://github.com/SEMICeu/Core-Person-Vocabulary/issues/39#issuecomment-994657852 I was not aware that this was a thing. Compared to CPV, CCCV is not trivial to implement with translations of only CCCVE.


My approach on these 3 topics

TL;DR: these points are a quick brainstorming after the @janbmgo comments. Point 3, which seems the focus of Goossenaerts to be usable on production, I'm not as sure as the part 1.

part 1

At least for Core Vocabularies (maybe application profiles, if they get translations) I think in the following months I could have proof of concepts of multilingual format. The one goal would be, based on templated files, by reading the file with everything possible to specify the target language. Then not just could it be easier to generate CPV documentation in languages such as Portuguese, but reuse directly to create more close to application profiles.

Then, something we could do is in addition to the translations, also have language tags we're BCP47 explicitly labels the country (and, for several standards inside a country) we have conventions on x-private extensions for BCP47 (https://tools.ietf.org/rfc/bcp/bcp47#section-2.2.7). This could handle the fact that some countries are likely to be already using other terms. By reusing the logic of BCP47, the only additional languages needed would be the ones where the base translations already do not have.

This might be easier explained in practice (that's why even if the tooling part allows, it would take months to wait to prepare translations). But my point here is to automate a bit the way to generate translations (and as side effects, allow it to be reusable for data standards).

part 2

For part 1 to become more effective to allow translations be ready to use, we would need to be realistic that even well intentioned governments and their inside regions willing to replace old local standards can take time. Also, since in the mean time they may be using other terms to mean the same thing, would it make sense to have public list of terms used (ideas such as "data by request", don't work in practice).

But if is going on sensitive topics, then, except if the terms used already is compiled by some cross national organization on that topic and SEMICeu republishes, I think mentioning name of persons (or even "the proposal for this term used locally comes from government ) will attach less collaboration. First, it is unrealistic to wait for every government to have someone to submit. Then there is a problem with the ones which are collaborating but could get more exposed and eventually be removed from their jobs and replaced with someone who will avoid help on this. I think a win-win situation here would allow anyone to suggest local terms as long as there are some documents which attest that such terms exist (like a public form or a webpage).

In general, most issues which are good to discuss here would be Part 1 (which are more technical). Bun in special concepts which are necessary for evidence (see part 3) not only would need some easier way to get collaboration for source concepts (and some code to allow translations, even if source term changes) but also some way to understand that the best specialists may not want to get public notoriety by this. It's also very important they get published and be found, otherwise audience would not use. From practical part, we from @HXL-CPLP even use simple online spreadsheets (which could be hidden behind http://proxy.hxlstandard.org/) to generate everything else (from data on other formats like JSON and XML to strings used on scripts).

Part 3

My current thinking on translating CCCVE to be closer to be directly usable (as in tabular data exchange) is actually thinking it is better expressed as narrow data. Similar issues I found here (https://github.com/EticaAI/hxltm/issues/11, use case https://github.com/EticaAI/tico-19-hxltm/blob/gh-pages/data/original/tico-19-terminology-google.csv) were the columns are fixed, but most meaning is part of data on the columns. For example, both Facebook and Google collaborate with world lists ("terminologies" without definitions) for over 100 languages, but the variables (column names) are just a few. The wide format equivalent (which is friendly to work direct by humans; also for optimized software access) would have over 100 columns more. And each new column means that the database schema would need to change.

I don't think I would have proof of concept anytime soon of something such as CCCVE using such type of transposition, but if considering storing with narrow format (which is friendly to exchange between systems) the new need would start to have some place to document/translate every term (concept) which already is not machine parseable (such as date).  For example if some "passport" can be evidence, then the term (or code) would need to be stored for "passport" (and it cannot be a RDF or semantic web, this really needs to be a code, even if the code actually is.... a full URL).

But even with narrow data, the constraints could be either together (in very repetitive) with the data or they be on separate tables while the code can be used to find those constraints. One reason for allowing store both together or separate if a government is sending data (narrow data often is generated by computers, not manually) while even do not have time to translate the concept, or worst case scenario the additional fields on the tabular format would explain logic the computer could validate.

Like I said, this part I don't plan to have proof of concept, but a narrow data approach is easier to implement and be production ready. But the new focus would be made as easier as possible to allow others publish codes used (like is the language codes used to transpose from narrow to large on https://github.com/EticaAI/tico-19-hxltm/tree/gh-pages/data/original).


PS.: Since I'm not aware what were the "One striking difference is the omission in V2 of the attribute Identifier type" mentioned on the Goossenaerts comment, maybe there is something I'm not aware which could be on V2 itself, without be derived work. It that's is the case, then I actually endorse any identifiers. In fact, this was one of the reasons for the #39 (which was not written to be a release blocker of 2.00 since it could take time if decided to go on some structured numeric appoach)

bertvannuffelen commented 2 years ago

@janbmgo I created an issue in adms-ap space, to relate it to a possible update of adms:Identifier.

In short on the structural handling of the properties dct:identifier and adms:identifier (pointing to an class adms:Identifier).

The first dct:identifier can only store the value. _:vec1 dct:identifier "2-ABD-123"^^xsd:string

Using the RDF typed literal approach a little bit of information about the value space can be provided. So instead of sharing a string, one could state _:vec1 dct:identifier "2-ABD-123"^^belgif:vehiclelicenceplate where belgif:vehiclelicenceplate points to a (xsd) description describing the value space of licence plates for cars. This approach is underused, probably most LD tools wouldn't know how to handle it, but from a semantical point of view it is correct and fits many purposes.

But non of the above addresses the metadata description needs you expressed above. That is addressed by using (often in addition) adms:identifier.

_:vec1 adms:identifier [ 
      skos:notation "2-ABD-123"^^xsd:string
]

This offers a structure to add any necessary metadata about the identifier. For instance adding the responsible for the identifier.

_:vec1 adms:identifier [ 
      skos:notation "2-ABD-123"^^xsd:string
      dct:creator <div@mobilit.fgov.be >
]

adms:Indentifier already provides support for

The second corresponds to your IdentifierType request, the last 2 address the ownerschip of the schema.

If this already satisfies the base of your request, then you can add e.g. dct:issued to indicate the creation moment of the identifier. For interoperability across Europe it is good that we align the meaning of additional properties, therefore I created a new issue on it in ADMS-AP repository.

Which one to use actually depends on your usage context. If the usage context actually determines all metadata of the identifier, the dct:identifier is probably the way to go, but in a broader context where this is unclear, adms:identifier is the way to go. But also both can be combined. The Core Vocabularies do not make a stand on this.

In DCAT-AP we are having webinars on identifiers. In there not only the above, but also the expectation on whether there is one identifier for an entity is being discussed. Because in your explanation of the issue, this is an unspoken constraint. Core Vocabularies do not express cardinality constraints, but in the practice this is an topic.

EmidioStani commented 1 year ago

Closing as managed in ADMS-AP