This workflow should identify duplicate instances based on the following comparisons:
OCLC Match
ISBN Match
ISSN Match
LCCN Match
Call Number Match
The output of this workflow should be a report in the following format:
This is closely followed excluding the Match in the column headers and call number column will not be included. Additional columns comparing title and author have been added as well.
OCLC Match
[x] 1. The OCLC numbers will need to be normalized.
[x] a. Only treat values with the (OCLoC) prefix as an OCLC number. Ignore all fields lacking this prefix.
[x] b. All non numerical characters shoud be removed.
[x] c. Trim all leading 0s
[x] 2. A Normalized OCLC number is considered to be a match when:
[x] a. The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches the Normalized OCLC number from the field:035 ord:1 sf:a of the target record
[x] b. The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches any of the Normalized OCLC numbers within the field:035 sf:z fields of the target record.
[x] c. The srs_marctab table should be used for this comparison
ISBN Match
[x] 1. The ISBN will need to be normalized as follows:
[x] a. Trim everything after and including the first white space
[x] b. Take the last 10 characters
[x] c. Of those last 10 characters, remove the last character
[x] 2. The Normalized ISBN of the source instance is considered a match if it is identical to the Normalized ISBN of the target instance.
[x] 3. The instance_identifiers table should be used for this comparison
ISSN Match
[x] 1. The ISSNs do not need to be normalized
[ ] 2. The ISSN is considered to be a match when the field:022 ord:0 sf:a or the source instance is identical the field:022 ord:0 sf:a of the target record.
[x] 3. The srs_marctab table should be used for this comparison
LCCN Match
[x] 1. The LCCN's do not need to be normalized
[x] 2. The LCCN is considered to be a match if the source LCCN is identical to the target LCCN
[x] 3. The instance_identifiers table should be used for this comparison
Call Number Match
[ ] 1. The call numbers do not need to be normalized
[ ] 2. An source instance is considered to match a target instance if any of the source instance's holding's call numbers match any of the target instance's holding's call numbers.
[ ] 3. This comparison should be done using the call_number, call_number_prefix, call_number_suffix columns of the holdings_ext table in folio reporting.
Caveats:
There are far too many call number matches to include in this report
ISBN require additional normalization to handle semi-colon and dollar sign
ISSN only had content for ord:1
Here is the instance duplicate report query excluding matches on call number:
WITH oclc_with_title AS (
SELECT ie.instance_hrid, LTRIM(REGEXP_REPLACE(SUBSTRING(sm.content FROM 8), '[^0-9]', '', 'g'), '0') AS oclc, ie.title
FROM folio_reporting.instance_ext ie
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE sm.field = '035' AND sm.ord = 1 AND sm.sf IN ('a', 'z') AND sm.content LIKE '(OCoLC)%'
),
oclc AS (
SELECT oclcwt.instance_hrid, oclcwt.oclc, oclcwt.title, sm.content AS author
FROM oclc_with_title oclcwt
JOIN public.srs_marctab sm ON oclcwt.instance_hrid = sm.instance_hrid
WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
oclc_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'oclc' AS match_type
FROM oclc l
JOIN oclc r ON l.oclc = r.oclc AND l.instance_hrid < r.instance_hrid
),
isbn AS (
SELECT ie.instance_hrid, NULLIF(LEFT(RIGHT(REGEXP_REPLACE(ii.identifier, ' .*', ''), 10), 9), ':') AS isbn, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE ii.identifier_type_name = 'ISBN' AND ii.identifier NOT SIMILAR TO '(:|$)%'
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
isbn_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'isbn' AS match_type
FROM isbn l
JOIN isbn r ON l.isbn = r.isbn AND l.instance_hrid < r.instance_hrid
),
lccn AS (
SELECT ie.instance_hrid, ii.identifier AS lccn, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE ii.identifier_type_name = 'LCCN'
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
lccn_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'lccn' AS match_type
FROM lccn l
JOIN lccn r ON l.lccn = r.lccn AND l.instance_hrid < r.instance_hrid
),
issn_with_title AS (
SELECT ie.instance_hrid, sm.content AS issn, ie.title
FROM folio_reporting.instance_ext ie
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE field = '022' AND ord = 1 AND sf = 'a'
),
issn AS (
SELECT issnwt.instance_hrid, issnwt.issn, issnwt.title, sm.content AS author
FROM issn_with_title issnwt
JOIN public.srs_marctab sm ON issnwt.instance_hrid = sm.instance_hrid
WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
issn_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'issn' AS match_type
FROM issn l
JOIN issn r ON l.issn = r.issn AND l.instance_hrid < r.instance_hrid
),
all_matches AS (
SELECT hrid, hrid2, title, title2, author, author2, match_type FROM oclc_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM isbn_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM lccn_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM issn_matches
)
SELECT
hrid AS HRID,
hrid2 AS HRID2,
MAX(CASE WHEN match_type = 'oclc' THEN 'T' END) AS OCLC,
MAX(CASE WHEN match_type = 'isbn' THEN 'T' END) AS ISBN,
MAX(CASE WHEN match_type = 'issn' THEN 'T' END) AS ISSN,
MAX(CASE WHEN match_type = 'lccn' THEN 'T' END) AS LCCN,
'"' || REPLACE(title, '"', '""') || '"' AS TITLE,
'"' || REPLACE(title2, '"', '""') || '"' AS TITLE2,
'"' || REPLACE(author, '"', '""') || '"' AS AUTHOR,
'"' || REPLACE(author2, '"', '""') || '"' AS AUTHOR2
FROM all_matches
GROUP BY hrid, hrid2, title, title2, author, author2;
Here is the complete instance duplication report query with call number matches:
very slowvery sparse report
WITH filtered_holdings AS (
SELECT instance_id, call_number, call_number_prefix, call_number_suffix
FROM folio_reporting.holdings_ext
WHERE he.call_number IS NOT NULL AND he.call_number !~ '^\s*$' AND he.call_number_type_id IS NOT NULL
),
filtered_marctab AS (
SELECT instance_hrid, content
FROM public.srs_marctab
WHERE field = '100' AND ord = 1 AND sf = 'a'
),
call_number AS (
SELECT ie.instance_hrid, TRIM(CONCAT_WS(' ', he.call_number_prefix, he.call_number, he.call_number_suffix)) AS full_call_number, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN filtered_holdings he ON ie.instance_id = he.instance_id
JOIN filtered_marctab sm ON ie.instance_hrid = sm.instance_hrid
),
call_number_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'call_number' AS match_type
FROM call_number l
JOIN call_number r ON l.full_call_number = r.full_call_number AND l.instance_hrid < r.instance_hrid
),
oclc_with_title AS (
SELECT ie.instance_hrid, LTRIM(REGEXP_REPLACE(SUBSTRING(sm.content FROM 8), '[^0-9]', '', 'g'), '0') AS oclc, ie.title
FROM folio_reporting.instance_ext ie
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE sm.field = '035' AND sm.ord = 1 AND sm.sf IN ('a', 'z') AND sm.content LIKE '(OCoLC)%'
),
oclc AS (
SELECT oclcwt.instance_hrid, oclcwt.oclc, oclcwt.title, sm.content AS author
FROM oclc_with_title oclcwt
JOIN public.srs_marctab sm ON oclcwt.instance_hrid = sm.instance_hrid
WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
oclc_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'oclc' AS match_type
FROM oclc l
JOIN oclc r ON l.oclc = r.oclc AND l.instance_hrid < r.instance_hrid
),
isbn AS (
SELECT ie.instance_hrid, NULLIF(LEFT(RIGHT(REGEXP_REPLACE(ii.identifier, ' .*', ''), 10), 9), ':') AS isbn, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE ii.identifier_type_name = 'ISBN' AND ii.identifier NOT SIMILAR TO '(:|$)%'
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
isbn_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'isbn' AS match_type
FROM isbn l
JOIN isbn r ON l.isbn = r.isbn AND l.instance_hrid < r.instance_hrid
),
lccn AS (
SELECT ie.instance_hrid, ii.identifier AS lccn, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE ii.identifier_type_name = 'LCCN'
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
lccn_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'lccn' AS match_type
FROM lccn l
JOIN lccn r ON l.lccn = r.lccn AND l.instance_hrid < r.instance_hrid
),
issn_with_title AS (
SELECT ie.instance_hrid, sm.content AS issn, ie.title
FROM folio_reporting.instance_ext ie
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE field = '022' AND ord = 1 AND sf = 'a'
),
issn AS (
SELECT issnwt.instance_hrid, issnwt.issn, issnwt.title, sm.content AS author
FROM issn_with_title issnwt
JOIN public.srs_marctab sm ON issnwt.instance_hrid = sm.instance_hrid
WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
issn_matches AS (
SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'issn' AS match_type
FROM issn l
JOIN issn r ON l.issn = r.issn AND l.instance_hrid < r.instance_hrid
),
all_matches AS (
SELECT hrid, hrid2, title, title2, author, author2, match_type FROM call_number_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM oclc_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM isbn_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM lccn_matches
UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM issn_matches
)
SELECT
hrid AS HRID,
hrid2 AS HRID2,
MAX(CASE WHEN match_type = 'oclc' THEN 'T' END) AS OCLC,
MAX(CASE WHEN match_type = 'isbn' THEN 'T' END) AS ISBN,
MAX(CASE WHEN match_type = 'issn' THEN 'T' END) AS ISSN,
MAX(CASE WHEN match_type = 'call_number' THEN 'T' END) AS CALL_NUMBER,
MAX(CASE WHEN match_type = 'lccn' THEN 'T' END) AS LCCN,
'"' || REPLACE(title, '"', '""') || '"' AS TITLE,
'"' || REPLACE(title2, '"', '""') || '"' AS TITLE2,
'"' || REPLACE(author, '"', '""') || '"' AS AUTHOR,
'"' || REPLACE(author2, '"', '""') || '"' AS AUTHOR2
FROM all_matches
GROUP BY hrid, hrid2, title, title2, author, author2;
Here are the individual identifier match queries:
Call Number
WITH call_number AS (
SELECT ie.instance_hrid, he.call_number_type_id, he.call_number_type_name, he.call_number, TRIM(CONCAT_WS(' ', he.call_number_prefix, he.call_number, he.call_number_suffix)) AS full_call_number, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.holdings_ext he ON ie.instance_id = he.instance_id
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE he.call_number IS NOT NULL AND he.call_number !~ '^\s*$' AND he.call_number_type_id IS NOT NULL
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
l.instance_hrid AS hrid,
r.instance_hrid AS hrid2,
l.call_number AS call_number,
r.call_number AS call_number2,
l.full_call_number,
'"' || REPLACE(l.title, '"', '""') || '"' AS title,
'"' || REPLACE(r.title, '"', '""') || '"' AS title2,
'"' || REPLACE(l.author, '"', '""') || '"' AS author,
'"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM call_number l
JOIN call_number r ON l.full_call_number = r.full_call_number AND l.instance_hrid < r.instance_hrid;
ISBN
WITH isbn AS (
SELECT ie.instance_hrid, ii.identifier, NULLIF(LEFT(RIGHT(REGEXP_REPLACE(ii.identifier, ' .*', ''), 10), 9), ':') AS isbn, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE ii.identifier_type_name = 'ISBN' AND ii.identifier NOT SIMILAR TO '(:|$)%'
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
l.instance_hrid AS hrid,
r.instance_hrid AS hrid2,
l.identifier,
l.isbn,
'"' || REPLACE(l.title, '"', '""') || '"' AS title,
'"' || REPLACE(r.title, '"', '""') || '"' AS title2,
'"' || REPLACE(l.author, '"', '""') || '"' AS author,
'"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM isbn l
JOIN isbn r ON l.isbn = r.isbn AND l.instance_hrid < r.instance_hrid;
LCCN
WITH lccn AS (
SELECT ie.instance_hrid, ii.identifier AS lccn, ie.title, sm.content AS author
FROM folio_reporting.instance_ext ie
JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE ii.identifier_type_name = 'LCCN'
AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
l.instance_hrid AS hrid,
r.instance_hrid AS hrid2,
l.lccn,
'"' || REPLACE(l.title, '"', '""') || '"' AS title,
'"' || REPLACE(r.title, '"', '""') || '"' AS title2,
'"' || REPLACE(l.author, '"', '""') || '"' AS author,
'"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM lccn l
JOIN lccn r ON l.lccn = r.lccn AND l.instance_hrid < r.instance_hrid;
ISSN
WITH issn_with_title AS (
SELECT ie.instance_hrid, sm.content AS issn, ie.title
FROM folio_reporting.instance_ext ie
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE field = '022' AND ord = 1 AND sf = 'a'
),
issn AS (
SELECT issnwt.instance_hrid, issnwt.issn, issnwt.title, sm.content AS author
FROM issn_with_title issnwt
JOIN public.srs_marctab sm ON issnwt.instance_hrid = sm.instance_hrid
WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
l.instance_hrid AS hrid,
r.instance_hrid AS hrid2,
l.issn,
'"' || REPLACE(l.title, '"', '""') || '"' AS title,
'"' || REPLACE(r.title, '"', '""') || '"' AS title2,
'"' || REPLACE(l.author, '"', '""') || '"' AS author,
'"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM issn l
JOIN issn r ON l.issn = r.issn AND l.instance_hrid < r.instance_hrid;
OCLC
WITH oclc_with_title AS (
SELECT ie.instance_hrid, LTRIM(REGEXP_REPLACE(SUBSTRING(sm.content FROM 8), '[^0-9]', '', 'g'), '0') AS oclc, ie.title
FROM folio_reporting.instance_ext ie
JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
WHERE sm.field = '035' AND sm.ord = 1 AND sm.sf IN ('a', 'z') AND sm.content LIKE '(OCoLC)%'
),
oclc AS (
SELECT oclcwt.instance_hrid, oclcwt.oclc, oclcwt.title, sm.content AS author
FROM oclc_with_title oclcwt
JOIN public.srs_marctab sm ON oclcwt.instance_hrid = sm.instance_hrid
WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
l.instance_hrid AS hrid,
r.instance_hrid AS hrid2,
l.oclc,
'"' || REPLACE(l.title, '"', '""') || '"' AS title,
'"' || REPLACE(r.title, '"', '""') || '"' AS title2,
'"' || REPLACE(l.author, '"', '""') || '"' AS author,
'"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM oclc l
JOIN oclc r ON l.oclc = r.oclc AND l.instance_hrid < r.instance_hrid;
This workflow should identify duplicate instances based on the following comparisons:
The output of this workflow should be a report in the following format:
OCLC Match
(OCLoC)
prefix as an OCLC number. Ignore all fields lacking this prefix.ISBN Match
9780134553351 (foo) : $16.50
->9780134553351
->~978~ 013455335 ~1~
->013455335
ISSN Match
LCCN Match
Call Number Match
Caveats:
Here is the instance duplicate report query excluding matches on call number:
Here is the complete instance duplication report query with call number matches:
very slow very sparse report
Here are the individual identifier match queries:
Call Number
ISBN
LCCN
ISSN
OCLC