chin-rcip / collections-model

Linked Open Data Development at the Canadian Heritage Information Network - Développement en données ouvertes et liées au Réseau canadien d'information sur le patrimoine
Creative Commons Zero v1.0 Universal
12 stars 1 forks source link

Do we need to make a difference between unknown data and absent data? #22

Open stephenhart8 opened 4 years ago

stephenhart8 commented 4 years ago

In museum datasets, some information or data are simply missing, even if we know this information exists somewhere, like the birthdate of someone. But sometime, this information does not exist, like the death date of someone still alive.

The problem is that in out model, there is no way to distinguish those two types of lack information. If the death date of someone is lacking in our dataset, it is not possible to know if we just haven't fully documented the record, or if this information does not exist.

KarineLeonardBrouillet commented 4 years ago

I am not sure what you mean by unknown and absent data: do we mean differentiating between information that we know exists but is absent from the dataset (i.e. the cataloguer looked for the information but could not find it, but knows it is available somewhere) and information we know is not available anywhere (i.e. the cataloguer looked in other datasets as well as offline documentation and established that an information is not available)?

If so I think it might be useful so that if Museum A looked for a piece of information exhaustively another institution might trust their work and prefer to focus on tracking other data that would be as impactful but more readily available. I think it would be useful in the context of crowdsourcing as well as it would flag data that would be most interesting to investigate.

Habennin commented 4 years ago

it's very difficult to represent in an open world. Recently discussed at CRM SIG and Linked Conservaiton Data workshops. It's hard to represent this kind of data in present day information systems

eecanning commented 4 years ago

It seems that we would be relying on the source museum/institution to note whether the information is unknown or absent, yes? I don't know what kind of barrier to participation this would introduce if we were to add this kind of review as a requirement. Do you have a particular example or use case in mind that we could use to illustrate the problem and conceptually test out solutions?

VladimirAlexiev commented 4 years ago

In LOD usually you don't state "I don't know" because such statements are non monothonic in the face of OWA (and because what we don't know is infinite :-).

So I would not go for some generic "missing value" patterns. Wikidata has novalue and unknown but their use is a bit controversial https://phabricator.wikimedia.org/T239414

If you know someone died but not when, make a Death event without TimeSpan.

TrangDg commented 4 years ago

I came across an example for this issue while working on MAC Artistes dataset: For "Levy, Albert" (French Photographer), the museum left his Death Place blank, but input "inconnue" for his Death Date. He was born in 1864, so we know for certain that he had passed away (at this point in time).

stephenhart8 commented 4 years ago

I've read some interesting researches by the Antike Fundmünzel in Europa about managing uncertain data, published at the CAA:

They propose multiple solutions, including:

I will add those references to Zotero.

VladimirAlexiev commented 4 years ago

@stephenhart8 thanks for the third reference!

I think you should make a separate issue on uncertainty and Attribution Qualifiers. See

emchateau commented 4 years ago

Although the open world argument is strong, in the case of autority data, the distinction between unknown or absent information can make sense.