DINA-Web / dina-model-concepts

Repository containing information to define data model boundaries
MIT License
3 stars 0 forks source link

Allow merge of many records into one canonical record #12

Open dshorthouse opened 4 years ago

dshorthouse commented 4 years ago

As a user, I will see many seemingly independent agent records but among some of these, I will see that some are alternate representations of the same entity (eg. John R. Smith vs John Smith). I would like the capacity to merge instances like these into a single destination record of my choosing. This means that values in identical fields across agent records to be merged are collapsed and all incoming links are re-attributed to the destination record I chose. I would like multi-entry fields to be deduped upon concatenation. In the event of conflict in the collapse of single-entry fields, I would like merge to cease with an error telling me the reason so that I can manually reconcile (i.e. make values identical) the differences that caused the error(s) to be thrown.

falkogloeckler commented 4 years ago

I'd suggest that an agent record should have one to many names. Reasons could be regular name changes (e.g. after marriage) or different ways to write the name (which @dshorthouse mentioned above). This is already addressed in DINA-Web/dina-model-concepts#5, but must not be mixed with merging records.

falkogloeckler commented 4 years ago

Originally independent agent records (referring to the same person or organization, but written differently and thus treated as separate instances during import or data entry) should be associated via isSameAs relations rather than an actual merge. This would preserve the step of interpretation which is semantically different to adding alternative names to a record. The latter would be something like an alsoKnownAs statement.

dshorthouse commented 4 years ago

I'd suggest that an agent record should have one to many names.

Indeed, this is definitely a requirement (+ language attribute for those aliases). In practice, especially during workflows such as high throughput digitization, we do not want the onus to be on the person doing the transcription to correctly interpret the identity of a collector. What that may mean is a need for a dirty bucket of Agent strings that have yet to be reconciled with & subsequently flagged as aliases (= merged) of other Agent entries. However, this can quickly get out of hand, as is often the case in EMu. The dirt tends to persist indefinitely. When there is spillage of Agents into other modules or components of modules (eg determination histories), we'll have to decide if, at the moment an agent as human or organization (other types?) requires entry, does that necessitate a new entry when none exists or a link to an existing entry in the Agent module? In EMu's case, there tends to be hundreds of entries of eg M. Smith because each are tied to different objects in the system.

The isSameAs vs alsoKnownAs is an interesting distinction. The former assumes a verifiable identity whereas the latter could be little more than pointers to entries in a dirty bucket of aliases. I'd argue that isSameAs makes little sense as metadata on an edge unless there is also an identifier that can be resolved.

It might be useful for us to see what's happening on wikidata. Here's Alexander von Humboldt: https://www.wikidata.org/wiki/Q6694. A canonical label at the very top, lists (flat?) of language-dependent aliases each with a single label (= alsoKnownAs) and a bunch of external identifiers at the bottom of the page (= isSameAs). And then, a whole slew of demographic properties like birth date whose use & selections are tightly coupled with instanceOf = human. That is, you cannot use properties like birth date for instanceOf = organization.

So...

I think we can assume that there will be dirt in the Agent module. And that we'll need utilities to merge entries whereby an entry becomes an alias of another. But, the mechanics of what merge means will undoubtedly be contextual and will evolve in time as modules become more interwoven and linked.

falkogloeckler commented 4 years ago

In practice, especially during workflows such as high throughput digitization, we do not want the onus to be on the person doing the transcription to correctly interpret the identity of a collector.

I agree.

What that may mean is a need for a dirty bucket of Agent strings that have yet to be reconciled with & subsequently flagged as aliases (= merged) of other Agent entries.

But this is also true for many other transcribed data (e.g. locality). So, do we really want dirty buckets for each module? I would rather suggest, that digitization would use only one dirty bucket in the forefront of the collection management system. Separate workflows for quality control and verification would only than allow ingesting the data into the agents module (and other modules of the system).

So people working in high-throughput digitization won't enter anything in the collection management system directly. This at least is the plan at MfN.

Besides that, we could keep things simple and assume that MVP 1.0 is getting verified data only.