Named Graphs : Record vs. Dataset

chin-rcip / collections-model

Linked Open Data Development at the Canadian Heritage Information Network - Développement en données ouvertes et liées au Réseau canadien d'information sur le patrimoine

Creative Commons Zero v1.0 Universal

12 stars 1 forks source link

Named Graphs : Record vs. Dataset #45

Closed stephenhart8 closed 3 years ago

stephenhart8 commented 4 years ago

This question has quite a huge impact on the model and should be discussed in a new issue.

Theory

We will need to decide the level of granularity we want for the meta-information about our data.

A Record is a regrouping of statements pertaining to a central instance. In our model, those central instances are Actors and Artefacts. A Dataset is a regrouping of statements coming from a unique provider.

Named Graphs allows adding some semantic statements about a bunch of triples, which is the best solution to add some metadata about the provider of the information and dates of the creation and modification of that data.

Having the Named Graph on the level of the record or the dataset comes with both advantages and disadvantages.

On the Record Pros:

It allows us to add some metadata on a more granular level than the dataset

Cons:

The record level is inherited from the old paper or table dataset structure. In the Knowledge Graph world, this division in « record » does not seem to work in my opinion.
Some triples are not directly linked to a record (like statements on vocabulary terms)

On the Dataset Pros:

It allows documenting the provenance of the data (from which provider) very easily

Cons:

The granularity of the data is not great

In the TM 2.0

In the Target Model 2.0, I have adopted an in-between solution.

As it is easier to handle provenance of the data with the Named Graph on the dataset, I have opted for that solution.

But in order to document information on the record, I have added an E73 Information Object linked to the Actor to mimic the Record. Then, information about the Record will be documented on the Information Object.

With this solution, provenance is handled better, and we do not lose the granularity of the record, without the downside of regrouping triples within a structure in records.

Questions

Is my approach the best one?
Having the Named Graph only on the dataset level would create some problems?
Is it possible to have a Named Graph within other Named Graphs? It seems not.

VladimirAlexiev commented 4 years ago

@stephenhart8 One very important role of Named Graphs is that they can represent Units of Work, i.e. business-meaningful transactions that match the granularity of data flows in an aggregator.

I feel strongly that you should use a Named Graph per museum Record. This will allow museums to use the simple SPARQL Graph Protocol to post records to CHIN. Connecting this to the total museum Dataset is simple since you can use some "part of" relations between graph URLs.

You'd furthermore need a Named Graph per CHIN Record, which is the result of Entity Matching (clustering) of individual museum records, then perhaps Data Fusion to pick or aggregate values per-field. And records Matching provenance (from which museum records, why and when did we match, what algorithm was used, what score, etc)

And you may need "output graphs" for CHIN consumers: that omit all internal/bookkeeping details.

(You don't need such complex machinery for static/simpler data like thesauri. Whether you use named graphs for such simpler data is a secondary and unimportant question.)

Let's leave aside for now important questions like:

which ontologies to use
whether to mix business and bookkeeping data as I've done below (perhaps not a good ontological idea, but is convenient)
whether to use timestamps in URLs
where to attach identifiers
where to attach copyright info

I think something like this (in TRIG notation):

graph <MUS1/person/123/graph> {
  <MUS1/person/123> a Person; name "John"; birthDate "1921"; birthPlace "Montreal"
  <MUS1/person/123/graph> a Graph; 
    kind <museumData>;
    contributor <MUS1>;
    partOf <MUS1/dataset>;
    updated "2020-03-30"; # generated at MUS1
    submitted "2020-03-31"; # to CHIN
    ingested "2020-04-01".
}

graph <MUS2/person/456/graph> {
  <MUS1/person/456> a Person; name "Johnny"; birthDate "1922"; birthPlace "Montreal"
  <MUS2/person/456/graph> a Graph;
    kind <museumData>;
    contributor <MUS2>;
    partOf <MUS2/dataset>;
    updated "2020-03-30"; # generated at MUS2
    submitted "2020-03-31"; # to CHIN
    ingested "2020-04-01".
}

graph <person/900789/graph> {
  <person/900789> a Person; 
    name "John"; alias "Johnny";
    birthDate "1921"; # we trust MUS1 more
    birthPlace "Montreal".
  <person/900789/graph> a Graph;
    kind <personCluster>;
    constituent <MUS1/person/123/graph>, <MUS2/person/456/graph>;
    contributor <MUS1>, <MUS2>;
    processed "2020-04-02";
    algorithm <personMatcher-name-birthDate-birthPlace>;
    algorithmVersion "1.01";
    confidence 0.95;
    usedFields name, birthDate, birthPlace.
  # could also record per-field match confidence, Data Fusion details...
}

stephenhart8 commented 4 years ago

@VladimirAlexiev That is really interesting way of structuring the datasets.

So if I follow you, for this John of Montreal, he would have in this example 3 URIs:

person/123 from Museum 1
person/456 from Museum 2
person/900789 for CHIN, that would cluster the 2 other URIs

With this data structure, we would NOT need a named graph for the whole submitted dataset? The information of the provider and the submission dates will be at the record level?

What about the triples that do not pertain to a record, for example information on types (type/male a tape/gender)? They would not be in any Named Graph? And what about instances that would belong to multiple records, like production events where multiple actors are participating in? Should this production event be described in one of the actor records? Or in the Object record?

I'm trying to understand how those record-level Named Graphs are structured. If we chose to adopt this - and actually it could solve some of the issues we have at the moment - that will change the target model quite a lot.

Another question, that is for @Habennin, how can those named graphs be mapped with 3M? Or is it something we should add after the mapping with 3M?

VladimirAlexiev commented 4 years ago

With this data structure, we would NOT need a named graph for the whole submitted dataset?

I assume many GLAMs (at least the big ones) will implement some incremental submission method (or CHIN will implement incremental ingest), so Units of Work will be records not whole datasets.

What about the triples that do not pertain to a record, for example information on types (type/male a tape/gender)? They would not be in any Named Graph?

They can be in the CHIN per-record cluster graph <person/900789/graph>. I think maintaining graph per output record will be beneficial for your consumers. Then CHIN can implement some incremental notification/consumption methods, eg:

OAI PMH to answer which are the new/changed records since a given date
RSS or Atom feed to notify about new/changed records

Then consumers can use SPARQL Graph Protocol to get these changed records. Even if they use plain GET on each semantic entity (Person), having a named graph will simplify CHIN's life in being able to fetch all needed triples at once.

Otherwise it's highly non-trivial how to delineate/circumscribe an entity and find all its triples. We faced such problems in ResearchSpace 8-9 years ago: https://confluence.ontotext.com/display/ResearchSpace/Complete+Museum+Object

Should this production event be described in one of the actor records? Or in the Object record?

That clearly belongs to the Object (Artwork)

But still, I acknowledge that the trouble with per-entity graphs is what to do about "border" entities. Eg

A reference to a thesaurus entry: should return at least the label, so the entity graph is self-contained.
A familial relation node PC14 and all its relations: should belong to BOTH entities
- I'd even say that the label of the target (other) entity should be returned, for the same reason as "thesaurus entry"

For such cases CHIN will need some "completion logic" to fetch such shared triples. In the SPARQL 1.2 effort I've argued it should be possible to describe entities (business objects) using RDF Shapes (SHACL, ShEx): https://github.com/w3c/sparql-12/issues/39

will change the target model quite a lot

What's the nature of these changes? Does the model specify graphs already?

mapped with 3M?

I don't think your model should be influenced in any way by the limitations of any tool.

VladimirAlexiev commented 4 years ago

In particular note this from sparql-12 "we may want 3 shapes (profiles) per object:

tiny: just its label (and maybe type), for display in autocomplete lists and referencing objects. There was another issue here to return a universal "label"
short: a few fields, for display in result lists. Maybe tabular, so the fields should be as "uniform" as possible across types that may be returned together
full: all fields, plus "owned" sub-objects, plus labels of referenced objects

stephenhart8 commented 4 years ago

We have already implemented Named Graphs in our Target Model, at the Provider's Dataset level, and with the E73_InformationObject entity at the record level (see above for the diagrams). It will also change the mapping process we are testing at the moment, as we will need to adjust the conversion script to create those named graph. Also, we will need to discuss at CHIN about those edge-cases or "border" entities.

so Units of Work will be records not whole datasets

So I guess if we want to enable search by providers' dataset (for example search only the institutions of Quebec), we would then need to request all the institution record named graphs. Would there be any problems to retrieve those "border" entities?

I don't think your model should be influenced in any way by the limitations of any tool.

Indeed, I agree that the model should not be influenced by the tools, but as we are testing 3M at the moment, it is important to know if those Named Graphs features are managed by it (with dataset named graphs, it is easy to add it after the mapping in 3M).

Thank you @VladimirAlexiev for your insightful comments, it is really appreciated!

stephenhart8 commented 4 years ago

From issue #43

Note: I call "datadump" a big gob of data, eg the whole CIC dataset. It certainly needs a semantic description: VOID + DCAT2 + maybe ADMS. Do you have a issue for that? http://vocab.getty.edu/doc/#Descriptive_Information is very comprehensive but is old (Mar 2014) and misses important DCAT2 developments.

I would like to add this part in this issue, even if it will probably require its own soon. Would we need to have a NamedGraph for the whole CiC dataset to add that descriptive information? In addition to having that kind of information at the record level?

Indeed, VoID is widely used, like in the project Nomisma.org. I'm not that familiar with DCAT2, but will have a more attentive look at it

Habennin commented 4 years ago

Good morning @stephenhart8 @VladimirAlexiev,

I don't think your model should be influenced in any way by the limitations of any tool.

Absolutely.

Re the original question:

Another question, that is for @Habennin, how can those named graphs be mapped with 3M? Or is it something we should add after the mapping with 3M?

The x3ml engine implements named graphs. The present 3M GUI does not. So at present if you want to use named graph in your mapping, you can add it to the x3ml (ie via a text editor working on the x3ml) directly and the GUI will ignore it, but the engine will not.

There was a presentation on this "new" functionality at a SIG a couple of years ago, but I cannot find it on the CIDOC CRM site. Nevertheless, the fact that the feature exists is documented in their GitHub.

https://github.com/isl/x3ml/issues/51

I would therefore ping them directly to task where the documentation is, hopefully so that they can put it somewhere more visible!

Re the general named graph issue, in Parthenos project, the strategy was to put the data into a named graph per record from the contributing institution.

VladimirAlexiev commented 4 years ago

Would there be any problems to retrieve those "border" entities?

Yes, this is the fly that spoils the "graph per record" story (but when you think about record-based aggregation processing, ou need to bite this bullet). One way to go about it is to say that shared/border entities like Family Relations are in their own separate graphs, then link them to both Person graphs, eg:

graph <familyRel/graph> {
  <familyRel> a PC14; P2 <parentChild>; 
    subject <person1>; # parent
    object <person2>; # child
    # dating
    # contribution info
}

<familyRel/graph> isPartOf <person1/graph>, <person2/graph>.

It is possible that the relation is not present in any museum record, but is deduced from two such records (eg one has a full record about John Smith and a textual mention of a child "Jane Smith", the other has full records about John Smith and Jane Smith but no relation, and the data about these persons matches some historical heuristics). Our paper Semantic Archive Integration for Holocaust Research: the EHRI Research Infrastructure, Umanistica Digitale, Mar 2019 shows something similar: search for "We can create Person records for the additional names mentioned"

if we want to enable search by providers' dataset (for example search only the institutions of Quebec), we would then need to request all the institution record named graphs.

That's not hard, once you define the model.

Consider that you'd often have the same person reflected in records from multiple museums (a large point of making CIC, and the subject of Entity Matching)
It's common in GLAM datasets to record sources and contributors for every record. Eg Getty and CHIN Nomenclature do that. You may choose to record this directly in the output record, even when you don't have specific info about a contributor dataset or record (eg Getty has no such info). Above I've included contributor <MUS1>, <MUS2>. Then you can query by such prop. Of course, in many cases this will be set based on the contributor record, during aggregation processing.

(with dataset named graphs, it is easy to add it after the mapping in 3M).

Yes: Most mapping tools don't bother about named graphs, yet it's easy to add such post-factum because you'd add only a couple of graphs per "operation". In fact I think the graphs should not be part of the mapping model but part of a new "data flow" (aggregation processing) model. Whether you add a graph for a whole museum dataset or for one museum record, depends on your processing model.

When you make a separate issue "Dataset Semantic Descriptions", please copy the couple of comments there.

need to have a NamedGraph for the whole CiC dataset to add that descriptive information?

VOID mostly describes documents not named graphs. So you don't need to have a CIC graph to describe the CIC dataset and its various renditions (Distributions). VOID is more popular for RDF datasets, but DCAT is used more widely (eg all data portals like DataHub and data.gov, based on CKAN software, can expose metadata as DCAT). And DCAT has useful extra props (eg byteSize).

illip commented 4 years ago

@VladimirAlexiev wrote:

Yes: Most mapping tools don't bother about named graphs, yet it's easy to add such post-factum because you'd add only a couple of graphs per "operation". In fact I think the graphs should not be part of the mapping model but part of a new "data flow" (aggregation processing) model. Whether you add a graph for a whole museum dataset or for one museum record, depends on your processing model.

As I understand, if we receive a single XML from Museum A, we should go throught the RDF mapping process without taking care of splitting everything in graphs. So in other words, one XML will result in one RDF file. Once this is done, we would need to identify where and how to divide the information in graphs (at the dataset or the record level) and document them properly. Meaning that we should document the providers, the contributors, rights, etc. after the mapping.

@stephenhart8, correct me if I'm wrong, but on the modeling side, the only impact would be to identify the pattern to document contributors and cataloguers (even if for latter we have decided to postpone the discussion due to legal issues). For contributors, this will happen when CHIN will aggregate already aggregated content (e.g. Museum A --> Aggregator X --> CHIN). In this case, we would normally have a contributor associated to a specific record.

It might also impact the rights management (see #38).

In this case, if we go with a graph strategy at the dataset level, we need a way to track the record entities in our model. If we decide to go at the record level, this will be managing after the mapping (although we need to keep track of the contributor somehow).

Vladimir also proposed:

graph <person/900789/graph> {
  <person/900789> a Person; 
    name "John"; alias "Johnny";
    birthDate "1921"; # we trust MUS1 more
    birthPlace "Montreal".
  <person/900789/graph> a Graph;
    kind <personCluster>;
    constituent <MUS1/person/123/graph>, <MUS2/person/456/graph>;
    contributor <MUS1>, <MUS2>;
    processed "2020-04-02";
    algorithm <personMatcher-name-birthDate-birthPlace>;
    algorithmVersion "1.01";
    confidence 0.95;
    usedFields name, birthDate, birthPlace.
  # could also record per-field match confidence, Data Fusion details...

If we would like to use this graph-constituent strategy (which I find relevant), we might need to have another step to our anticipated pipeline (sorry Vladimir, another document that we are looking to publish). For the moment, we have External Reconciliation and Enrichment (step 7):

The entirety of the available data pertaining to a person (Jean Paul Riopelle in this case) across CHIN’s data will then be gathered (external reconciliation), which will generate new data and will involve semi automatic text analysis as well as conversion lists in order to create links to relevant IDs (external enrichment). This new data will include new permanent CHIN IDs for actors so that information relevant to them can be federated.

This is currently happening before the RDF mapping and, as I understand, it should happen after it in a step that should include the Graphs generation (might need two distinct steps, I don't know).

However, one thing I'm sure, we shouldn't state these info in the CHIN cluster:

name "John"; alias "Johnny";
    birthDate "1921"; # we trust MUS1 more
    birthPlace "Montreal".

The goal of CHIN is not to state information about actors but to gather it. So I would recommend to keep the data at the constituent level and never decide which info is the proper one (e.g. the birthDate in this example). One thing that CHIN could be documenting is some sameAs statements to Wikidata, ULAN, VIAF, etc.

VladimirAlexiev commented 4 years ago

if we receive a single XML from Museum A, we should go throught the RDF mapping process without taking care of splitting everything in graphs. So in other words, one XML will result in one RDF file.

I think it'll be easier for CHIN to split up the input into smaller chunks. This will probably lead to easier processing: very few RDF convertors use "streaming" processing, so it's easy to overflow their memory use with a very large input file. XML and JSON are not streaming-friendly because they have opening & closing elements, but most large processing platforms or aggregators use some chunking format:

Geonames and VIAF RDF are provided as one RDF record per text line (so the total file is NOT valid RDF)
Line-oriented JSON (ndjson) was invented for the same reason

Using individual Records as "unit of work" is also an enabler to incremental processing.

External Reconciliation and Enrichment are currently happening before the RDF mapping and, as I understand, it should happen after it in a step that should include the Graphs generation

I cannot say whether it should happen before or after, that depends on tooling that you select and the specifics of your pipeline. However, by keeping the data in a graph you open up possibilities for more advanced matching using graph features, eg by Graph Neural Networks and Graph Embeddings.

I also cannot say whether clients would be interested at all in the details of matching (bookkeeping vs business data). If you need to structure matching data, see https://ns.inria.fr/edoal/1.0/ (and http://melinda.inrialpes.fr/proposal.html). But you certainly need to keep such details because they affect the processing and reprocessing of clusters. See Managing Ambiguity In VIAF. Thomas B. Hickey and Jenny A. Toves. D-Lib Magazine, July/August 2014, Volume 20, Number 7/8. doi:10.1045/july2014-hickey.

Have you thought about how to ensure URL stability of CHIN's clusters? This is crucial to give GLAMs the certainty they need in order to use CHIN's IDs.

I feel strongly that CHIN should provide fused data. The main value of CIC will be as a single repo of all info about Canadian artists, and the main thing that consumers want would be direct access to the data about each artist. In many cases it will be possible to pick the "correct" values of single-value fields, through automated or manual processes or crowdsourcing (some data feedback mechanism).

Even if you can't pick one value over another, you can provide both at the top (aggregated data) level by putting them in appropriate CRM fields, eg

<person/900789/birth/date> a Time-Span;
  P82a "1921"; P82b "1922".

Consider the more complicated case of 3 museums submitting the dates 1921, 1922, 1923. Certainly you'll do your consumer a favor if you sort them out and provide this:

<person/900789/birth/date> a Time-Span;
  P82 "1921", "1922", "1923"; # optional
  P82a "1921"; P82b "1923".

It'll be much harder if consumers need to dig through individual museum records...

illip commented 3 years ago

CHIN considers this issue as being the most important one to resolve before starting the implementation of our first datasets in order to assess our pipeline. Thus, we tried to break down all the arguments in favor of a Named Graph per record or per dataset. We hope that we have represented well everyone's ideas. If it is not the case, feel free to comment in this thread and we will update this post. We didn't do the pros and cons to avoid annoying duplicates. We hope to run some tests regarding Named Graphs mid-September 2020.

1. Arguments in favor of one named graph per record

1.1. Museums could use simple SPARQL Graph Protocol to post record to CHIN (@VladimirAlexiev ) 1.2. Simple to connect the records to a dataset with some sort of part_of (@VladimirAlexiev) 1.3. Would allow having a CHIN's record to aggregate the content (@VladimirAlexiev) 1.4. Would allow to detail the Matching Provenance (score, algorithm, when, which museum records) (@VladimirAlexiev) 1.5. Would be useful to get output graphs that omit all internal/bookkeeping details (@VladimirAlexiev)
1.6. An enabler to incremental processing (@VladimirAlexiev) 1.7. RDF Shapes could be used to manage border entities (@VladimirAlexiev)

2. Arguments in favor of one named graph per dataset

2.1. Metatypes E55 -> p2 -> E55 would be easier to manage since they don't really reside in a specific record (@stephenhart8). For instance, border entities could create duplicates in different record graph to keep them readable (@VladimirAlexiev) 2.2. The record structure is arbitrary, and not suitable for a knowledge graph where all the nodes can become the entry point. For instance, the birth event of someone is not within the record of an actor, it is an event that can be linked to other events. (@stephenhart8 and @Flutifioc) 2.3. Seems easier to manage the provenance (@stephenhart8)

3. Other considerations where the two options offer different approaches

3.1. Retrieval of border entities could be quite straightforward with a dataset approach (@stephenhart8), however, it is doable with records using dedicated graphs and dedicated contributor's properties (@VladimirAlexiev) 3.2. Searching in a specific stakeholder's dataset is straightforward with a dataset (@stephenhart8) but could also be done using a statement linking the record to the stakeholder URI (@VladimirAlexiev). 3.3. With the record approach, it would be difficult (even impossible) to describe the whole dataset of an institution since everything will be nested (@stephenhart8). This might be not useful since the subgraphs description should be enough (@VladimirAlexiev)

General questions:

4.1. When in our pipeline we should create our Named Graphs? (@illip)

4.1.1. Graphs shouldn't be part of the mapping process but in another workflow (aggregation processing) (@VladimirAlexiev) 4.1.2. It seems like the split into smaller chunks should be done prior to the mapping (@VladimirAlexiev) 4.1.3. The x3ml engine implements named graphs. The present 3M GUI does not. So at present if you want to use named graph in your mapping, you can add it to the x3ml (ie via a text editor working on the x3ml) directly and the GUI will ignore it, but the engine will not. (@Habennin)

4.2. Which model(s) should we use to describe our graphs/datasets? (@stephenhart8)

4.2.1. VOID is more popular for RDF datasets, but DCAT is used more widely (eg all data portals like DataHub and data.gov, based on CKAN software, can expose metadata as DCAT). And DCAT has useful extra props (eg byteSize). (@VladimirAlexiev)

VladimirAlexiev commented 3 years ago

1.6. Is very important. I can't see how CHIN can implement an efficient aggregator if a contributor institution can't update the data of a single artist (not it's full dataset).

4.2.1. Use both, see Getty LOD

illip commented 3 years ago

After discussion with our Semantic Committee, CHIN has decided to go with one Named Graph per dataset. The main reasons are:

Based on the preliminary results of our survey sent to the CRM-SIG members, it seems like many of them are using NG at the dataset level.
The identification of the record's borders is an arbitrary process that might be complicated to understand for an external user.
It seems doable to update a specific record with the proper SPARQL query without having a Named Graph per record.

This discussion also brings the question of the institutional URIs that we will cover in the URI Policy/Framework later on (Issue #43).

illip commented 3 years ago

All the aforementioned items have been added to the Target Model (Provenance of the dataset)