Open rmathew1011 opened 6 months ago
Working on the identification of duplicate records first to then better determine what the definition of a 'duplicate' record is. We may want to venture into a plan involved heuristics to determine what a duplicate is.
This workflow is meant to be the first step in a large, overarching workflow that will help the librarians maintain and keep their data clean at scale.
Utilizing the LDP data for now.
Implementation Strategy:
This workflow should identify duplicate instances based on the following comparisons:
The output of this workflow should be a report in the following format:
Each match should apply the following criteria:
OCLC Match
(OCLoC)
prefix as an OCLC number. Ignore all fields lacking this prefix.ISBN Match
9780134553351 (foo) : $16.50
->9780134553351
->~978~ 013455335 ~1~
->013455335
ISSN Match
LCCN Match
Call Number Match
This needs a schedule worklow, at an anual cadence.
You can create working tables in the mis schema of the LDP.
The results should be emailed to a variable email address.
Rows should only be included if at least one of the matches is true.
Original Text
This script should identify duplicate instances. The specific criteria for to determine that instances are duplicates will be provided, and will most likely be a comparison of multiple data points on the two instances.
The script should combine the two instances by keeping the oldest of the two instances, and removing the newest. All holdings and items from the newest instance should be moved to the oldest instance.
Update: Create a workflow to accomplish the above report - sent as an email (csv as an attachment)
Additional requirements:
Add title and author field for both matching instances.
Report columns as: