chin-rcip / collections-model

Linked Open Data Development at the Canadian Heritage Information Network - Développement en données ouvertes et liées au Réseau canadien d'information sur le patrimoine
Creative Commons Zero v1.0 Universal
12 stars 1 forks source link

Mapping the Museums' vocabulary to the TM and SPS #8

Open chin-rcip opened 4 years ago

chin-rcip commented 4 years ago

Heather Dunn 16.08.2019 Problem faced during the mapping to the Reference Model about not knowing what to do when there was a whole hierarchy of object terms/classifications represented in the museum data (that does not quite correspond to either Nomenclature or AAT). For example, the CMST data had a record with: Object Name: Wagonette Group: Non-Motorized Ground Transportation Category: Animal Powered Subcategory: Wheeled Vehicles My initial thinking is that we would only map the most specific level (Wagonette) to Nomenclature or AAT, and leave it at that. This would correspond with the thinking for locations, where we only map the most specific level, and link to the geo name authorities to fill in the whole hierarchy. In a similar way, we have the whole Nomenclature hierarchy to fill in the higher levels for the most specific object name we find in the museum data. But I think the museum’s categorization (often home-grown even though usually based on some version of Nomenclature) should also be mapped and present in the record, even if we only need the most specific term to link to the authority. We need the museum’s categorizations to understand the context of the most specific term, in order to validate our mapping to Nom/AAT. But it wasn’t clear to me how to represent that in the RDF. Multiple occurrences of the same property (Object Type) where we put “wagonette”, so we would have “Wheeled Vehicles” etc. as Object Types too? But then how do we distinguish which is the most specific term (the Wagonette)? Can we add “p2HasType” types to the Object Type? Or some other way to represent the hierarchy that the museum has represented in their record… This also makes me wonder about the other fields (like location) where a hierarchy may exist in the museum data but we are only including their most specific location data and mapping that to geo name authorites. Don’t we really need the museum’s broader location data to disambiguate, so that we can properly map it to the geoname authorities? E.g. we map “London” as their most specific location but we need to know if this is Ontario or England, need to go back to their broader geo data for that. So does this mean we need to accommodate the museum’s various hierarchies for locations and for classifications, as represented in their data? Just for purposes of disambiguation and validation? Even though we will link the most specific level to our own authorities that have their own hierarchies… (edited) Stephen Hart 16.08.2019 Indeed the museum’s vocabulary should be documented in a way, but also we do not want to managed dozens of different vocabularies… Maybe it’s something that should be discussed with the museum in question?

Philippe Michon 19.08.2019 The vocabularies questions won’t be easy ones, but I think it’s important to relate, as much as possible, to external LOD compliant controlled vocabularies. The best process would be to suggest a vocabulary (might be more than one) for a specific field. In this case, is not necessary to manage all the terms since the vocabulary is doing it. On the other hand, the museum might want to keep is tailor made hierarchy. This is where the fun begins.The best option would be that they develop a LOD vocabulary for their terms. If so, we could only include the more specific term. That said, if we develop some applications, they will be based on our chosen vocabs, so they won’t get all of our features.Another option would be to include all the terms in what we are calling “Linguistic Object fields”. You didn’t see those fields on Thursday but the idea is to provide a way to manage messy data in our model. The information will be record but won’t be reusable.Last option would be to manage custom hierarchy in our model with p127_has_broader_term. After reflection, I don’t believe it’s a good idea since it won’t bring more interoperability (no one except the museum will use this hierarchy). In addition, this pattern won’t be easy to manage in our RM.For the location, we wanted to avoid exploring Geonames and TGN (next time!). In fact, we won’t streamline the data to only smaller units but do the reconciliation using the full “hierarchy”.

Heather Dunn 19.08.2019 I like that idea of the “linguistic object fields” to keep the messy data from the museum’s own (possibly home-grown, non-standard) broader-level object classifications. That way we can still use those broader levels to disambiguate the museum’s most specific term (which we will map to standard thesauri), but we don’t have to manage every museum’s home-grown custom hierarchy.I remembered that Sheila and I asked this question to Rob Sanderson and David Newbury at the Getty last year – what do you do in instances where the hierarchy used in the museum data is not the same as the one in your preferred authority? I looked back in our notes from that meeting, and they advised : Link from the most specific level of the museum data to wherever that fits within the preferred authority. If the museum data is more specific than the authority, can model it as a narrower concept to an existing concept in the authority (and potentially add to the authority sometime if warranted).

KarineLeonardBrouillet commented 4 years ago

I like the idea of the Linguistic Object Field as well. The only thing I would be worried about is if an institution is using non-linguistic information? Would that be possible? I have not seen it apart from geotagging for example but we would already be able to manage that

KarineLeonardBrouillet commented 4 years ago

A quick sketch to illustrate the point Heather is making above (the blue lines indicate the museum and its hierarchy while the green would be the standard authorities we are using) : IMG_20191106_140651

Habennin commented 4 years ago

With types, where significant work may have gone into the local representation, it can be interesting to just keep the original data and then decide which one to target for creating a standard field for searching on. You can make sure that these types are distinguishable from the ones you want to use for searching data etc. by making use of the p71 listed in property where you say that this type comes from this vocabulary. In the context of your search interfaces you can ignore such types and/or present them lower down in the hierarchy. But if a field is used as a type and you want to keep it, you should model it as a type.

stephenhart8 commented 4 years ago

So if I understand your comment @Habennin we should do like in the following diagram? CiC_Issue8-example1

The object should be linked to both E55 Type, but each differentiated with the property P71i listed in?

VladimirAlexiev commented 4 years ago
illip commented 4 years ago

Hi all,

So first of all, I think we all agree that in the best case scenario, each museum would be using standardized LOD vocabularies or, at least, mananing their own vocabularies with SKOS. In this case, we could simply ingest the more specific term because it would be possible to get back the meaning of this term by fetching the corresponding hierarchy. So we could easily have two P2_has_type --> E55_Type: one for the CHIN's recommended concept and the Museum X's one. I also agree that we need a way to distinguish both vocabularies in order to apply the proper search. So skos:inScheme or P71_listed seem to work (just need to decide which one is the best).

Job's done.

However, in real life, unfortunately this won't happen often (probably never in the coming years). So we need a way to keep track of non-URI concepts and their hierarchy. I don't know if @stephenhart8's representation is correct but if so, it doesn't allow us to represent clearly the Museum X's homemade hierarchy. However, I'm not sure if @Habennin's entire idea is covered in this schema.

@VladimirAlexiev, I might need more explanations to understand your proposal:

  • hi Heather! In such cases you may have to record both the museum classification, and a derived Nomenclature classification

What do you mean by a derived Nomenclature classification? You would record the Museum X's concept hierarchy part (e.g. Wagonette, Non-Motorized Ground Transportation, Animal Powered and Wheeled Vehicles) and reconcile each concepts to the AAT terms for instance?

  • in straightforwardly mapped situations you can afford to record only the Nomenclature concept

Agree, but I don't know if this will be a common situation and we will need to develop a mechanism to identify the similarities automatically.

  • but still need to record all levels, unless the hierarchies match perfectly

Agree.

  • put each concept in P2, and you can use Attribute Assignment or the new PC classes to record the level of classification

This is where you lost me, so you would add all the stated concepts in our graph using P2_has_type and the link between each of them would be a level attribution through a E13_Attribute_Assignment? Not sure how you envision to use PC classes in this case. Why don't we go with p127_has_broader_term or skos:broader directly? because we want ton keep track of the category labels (e.g. group, category, etc.)?

That said, I'm not sure if it's a good idea to define URIs for those external terms in order to keep track of them. I say that because we won't document the whole museum X's hierarchy but just some parts of it depending on specific object descriptions. One thing is sure, CHIN doesn't want to maintain a bunch of parallel hierarchies and one day, if the museum decided to create URIs or reuse the Getty's vocabularies, these URIs would become useless. In brief, it seems to be a lot of work for unclear benefits.

  • you might also have to record a text field "named as" or "stated as" . Wikidata and Getty CONA have such fields. But this is in addition to the concept! I think that using a pure label (Linguistic annotation) is not a good idea

It seems like you would recommend to reconcile the museum's hierarchy to the AAT and then add a statement to express how the institution has named this concept. So in our data, it will be the AAT URI with the specific label from the instutition? Am I right?

Thanks for your input :)

VladimirAlexiev commented 4 years ago

each museum would be using standardized LOD vocabularies or, at least, managing their own vocabularies with SKOS.

That's an overly idealistic point of view. I think that a major mission of CHIN should be to reconcile museum thesauri against established vocabs (NOM, AAT, LCSH, Iconclass, CONA Iconography...) and as a result grow these thesauri (in particular NOM). Getty does the same with AAT, TGN and ULAN: every time they take some museum data, they first map concepts to these thesauri and grow them when warranted.

I think that every time a museum uses ANY thesaurus consistently, CHIN should take that as a favor and incorporate that thesaurus aggressively. If a museum concept is used in 100 Artworks, the work of reconciling the concept will be leveraged 100x when processing those artworks. The alternative is to have a mess of free-text fields that hinder searchability (we did detailed work on Artefacts Canada Data Analysis, trying to match them to a bunch of thesauri... the results are very mixed).

Whether a museum thesaurus is in SKOS is the least problem: the more effort-intensive part is the reconciliation. Thus CHIN needs to establish and nurture data flows for sharing that work with museums. And appropriate tooling, whether it'll be Wikidata Mix-n-Match or GVB Cocoda, or something else.

CHIN doesn't want to maintain a bunch of parallel hierarchies

Agree! But you have to help along the museums towards using global classifications

one day, if the museum decided to create URIs or reuse the Getty's vocabularies, these URIs would become useless.

Agree. But how to keep the data in the interim?

derived Nomenclature classification?

Yes, I mean the result of recon against NOM or AAT.

Whether to reconcile each of the museum parent terms or only the leaf term is a very good question that has to do with hierarchy conformance between thesauri (they never do) and compound (pre-coordinated) terms vs using multiple terms at the object.

Depending on these answers you may have to add more neutral or more compound terms to AAT or NOM, and apply more or less terms at the object.

So in our data, it will be the AAT URI with the specific label from the instutition? Am I right?

Yes. To express such details (local label, level) you need to reify the P2 property, which you can do with E13_Attribute_Assignment or PC2 (if PC2 exists in CRM).

skos:broader sort of does the job of "level" but it should only be used in the thesaurus, not while applying a term to an object.

stephenhart8 commented 3 years ago

During the test phase of the 2.0 target model, a dataset containing non-hierarchical and non-LOD home-made vocabularies confronted us on the way to manage vocabularies within the project.

It appeared that the management of provider's vocabularies is beyond the scope of the current project. The main reason for that is that CHIN doesn't have the resources to analyse and reconcile a multitude of different vocabularies.

Therefore, a simple solution must be chosen that allows us to use en external vocabulary for consistency and still document the provider's vocabulary.

Two cases can be found (we're putting asside the question of the hierarchy)

LOD Provider's Vocabulary

In the first case, it is pretty easy, as the provider's vocabulary is then an E55 Type. Another E55 Type for the AAT vocabulary (or any other external vocabulary) will also be added, by being linked to the typed entity with the property p2_has_type. This reconciliation would be easier is the provider's vocabulary has links to external vocabularies. If it's not the case, a reconciliation with an external vocabulary could be made based on the label of the pervider's vocabulary terms and other contextual information. LOD_Vocab

Non-LOD Provider's Vocabulary

In the second case, because de provider's vocabulary is nor in LOD, it is mendatory to add an external vocabulary for a better structure of the data. Nonetheless, it is important that the provider's term is preserved in the RDF dataset. The solution adopted by CHIN for the meantime is to use the "Mapping problems and E33 Linguistic Object" pattern by linking an E33_Linguistic_Object to the entity that is typed with the property p67_refers_to. This E33_Linguistic_Object is then typed as a "Type Statement", and the p190_symbolic_content of this E33_Linguistic_Object is the terme of the provider's vocabulary.

The reconciliation between the provider's vocabulary and an external would be also based on the term used by the provider, with the help of contextual information if present. The idea is therefore not to translate the structure or hierarchy of the provider's vocabulary, simple to try to find the proper match in an external vocabulary. nonLOD_Vocab