Center-for-Research-Libraries / crl-serials-validator

Validate bibliographic and holdings data for shared print.
GNU General Public License v3.0
0 stars 1 forks source link

flag duplicate fields in output so they can be removed in 2_Preparation #34

Closed AndyElliottCRL closed 2 years ago

AndyElliottCRL commented 2 years ago

@awood on Teams 2021-11-08: There were something like 800+ duplicate fields in the report. I suspect they are in the incoming data? Most of the duplicate fields were 583$a completeness reviewed, but condition reviewed had some too. I'm not sure if committed to retain did. It was extremely difficult, using Excel, to pinpoint full duplicate lines--every single field was the same--because there are sometimes reason to have multiple 583s for these actions. Can we implement checks to find duplicate fields, call them out in the validation reports and delete them to ingest a clean record?

AndyElliottCRL commented 2 years ago

paraphrase @nflorin on Teams: We'd have to define what we mean by a duplicate -- deleting exact duplicates will be a lot easier than deleting almost exact duplicates. Main operational question: where this should happen. We could use the validator to flag issues and then add a little layer to a later step in the process to make the specific changes.

To avoid function creep, Validator will Flag duplicates, however defined, in its output, working with @tmoss-crl in the context of https://github.com/Center-for-Research-Libraries/2_Preparation/issues/5 ; that issue will actually remove or not process duplicates, Validator functionality remains strictly in validating and reporting.

Discussion item for 2021-11-09T14:00 PAPR

AndyElliottCRL commented 2 years ago

Linking semi-related closed issue related to duplicate records as defined in preprocessor: https://github.com/Center-for-Research-Libraries/PAPR-papr-working/issues/7 In code as received from CDL, duplicate records are whole string matches.

nflorin commented 2 years ago

I've created a PAPR output flag, that makes it so all of the files other than the main review spreadsheet (good & bad MARC, the LHR worksheet) are only output if the user asks for them as a group. I'll create another file that lists duplicates only and we'll see if that handles the issue. If not I'll figure out a way to insert this into the review worksheet.

nflorin commented 2 years ago

This doesn't seem to be working. Right now I think we're only really flagging duplicates on holdings IDs, which aren't even required. There is space in the main data structure to track duplicated bib IDs, local OCLCs, and WorldCat OCLCs. I'll make sure those are at least checked for and added as optional errors (warnings by default). That won't eliminate duplication, but will allow the user to at least learn about it.

nflorin commented 2 years ago

I think I fixed this. We now have duplicate checks for local OCLC, WorldCat OCLC, bib ID, and holdings ID. Checks are by institution and location, if we have a location. This means a local OCLC found at both location A and location B would not count as a duplicate.