NatLibFi / bib-rdf-pipeline

Scripts and configuration for converting MARC bibliographic records into RDF
Creative Commons Zero v1.0 Universal
29 stars 5 forks source link

Merge persons with the same name when merging works #63

Closed osma closed 6 years ago

osma commented 6 years ago

Currently when merging works we leave the persons (creator, contributor etc) intact. This often causes duplication of entities. For example, for the work W00009584100 (Ajan lyhyt historia), we get two contributors named "Sagan, Carl" and two named "Miller, Ron".

osma commented 6 years ago

This is difficult since we do the merging one slice at a time, and typically we don't know the author of the work we are merging with since it's likely asserted in a different slice.

osma commented 6 years ago

Options:

  1. Do this kind of merging at the consolidation step, when we have all data available
  2. Calculate merge keys also for persons and merge them the same way we merge works
osma commented 6 years ago

Here's a unit test for future use/adaptation.

@test "Merge works: merge authors of the same work with the same name" {
  make merged/kotona-merged.nt
  count="$(grep -c -F '<http://schema.org/author>' merged/kotona-merged.nt)"
  [ "$count" -eq 4 ]
}
osma commented 6 years ago

Also person subjects and contributors duplicated in W00591682300