Create a FOLIO Workflow that identifies duplicate instances

rmathew1011 commented 6 months ago

This workflow should identify duplicate instances based on the following comparisons:

OCLC Match
ISBN Match
ISSN Match
LCCN Match
Call Number Match

The output of this workflow should be a report in the following format:

Each match should apply the following criteria:

OCLC Match

The OCLC numbers will need to be normalized.
- Only treat values with the (OCLoC) prefix as an OCLC number. Ignore all fields lacking this prefix.
- All non numerical characters shoud be removed.
- Trim all leading 0s
A Normalized OCLC number is considered to be a match when:
- The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches the Normalized OCLC number from the field:035 ord:1 sf:a of the target record
- The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches any of the Normalized OCLC numbers within the field:035 sf:z fields of the target record.
The srs_marctab table should be used for this comparison

ISBN Match

The ISBN will need to be normalized as follows:
- Trim everything after and including the first white space
- Take the last 10 characters
- Of those last 10 characters, remove the last character
- 9780134553351 (foo) : $16.50 -> 9780134553351 -> ~978~ 013455335 ~1~ -> 013455335
The Normalized ISBN of the source instance is considered a match if it is identical to the Normalized ISBN of the target instance.
The instance_identifiers table should be used for this comparison

ISSN Match

The ISSNs do not need to be normalized
The ISSN is considered to be a match when the field:022 ord:0 sf:a or the source instance is identical the field:022 ord:0 sf:a of the target record.
The srs_marctab table should be used for this comparison

LCCN Match

The LCCN's do not need to be normalized
The LCCN is considered to be a match if the source LCCN is identical to the target LCCN
The instance_identifiers table should be used for this comparison

Call Number Match

The call numbers do not need to be normalized
An source instance is considered to match a target instance if any of the source instance's holding's call numbers match any of the target instance's holding's call numbers.
This comparison should be done using the call_number, call_number_prefix, call_number_suffix columns of the holdings_ext table in folio reporting.

This needs a schedule worklow, at an anual cadence.

You can create working tables in the mis schema of the LDP.

The results should be emailed to a variable email address.

Rows should only be included if at least one of the matches is true.

Original Text

This script should identify duplicate instances. The specific criteria for to determine that instances are duplicates will be provided, and will most likely be a comparison of multiple data points on the two instances.

The script should combine the two instances by keeping the oldest of the two instances, and removing the newest. All holdings and items from the newest instance should be moved to the oldest instance.

Update: Create a workflow to accomplish the above report - sent as an email (csv as an attachment)

Additional requirements:

Add title and author field for both matching instances.

Report columns as:

hrid, hrid2, oclc, isbn, lccn, issn, call_number, title, title2, author, author2

Dbreck-TAMU commented 6 months ago

Working on the identification of duplicate records first to then better determine what the definition of a 'duplicate' record is. We may want to venture into a plan involved heuristics to determine what a duplicate is.

This workflow is meant to be the first step in a large, overarching workflow that will help the librarians maintain and keep their data clean at scale.

Utilizing the LDP data for now.

rmathew1011 commented 6 months ago

Implementation Strategy:

A CTE that creates a cached table with two columns: hrid and oclcs
- for each hrid we will use the public.srs_marctab table to create a concatenated string of all normalized OCLC numbers.
- The OCLC numbers that will be concatenated are the field:035 ord:1 sf:a, and all field:035 sf:z
- Normalization will retain all integers, and remove all alphanumerics, special characters, and spaces
The deliverable report will be a two column csv, with hrid and duplicates as the column names
- For each hrid, we will identify every hrid from the CTE where the OCLC (field:035 ord:1 sf:a from public.srs_marctab) is contained within the oclcs string
- Since there may be multiple duplicates for each hrid, the duplicates column should be a comma separated list of all hrids that are identified as duplicates.

rmathew1011 commented 6 months ago

IdentifyDuplicateInstances

TAMULib / fw-registry

Create a FOLIO Workflow that identifies duplicate instances #350