NatLibFi / RecordManager

A metadata record management system written in PHP, intended to be used in conjunction with VuFind or another Solr-based discovery interface. Provides import, export, harvesting (OAI-PMH), normalization, deduplication and Solr index update functionality with support for multiple metadata formats. Also includes an OAI-PMH provider that can be used to access the data stored in RecordManager database. Functionality driven by simple command line programs for easy automation.
GNU General Public License v2.0
47 stars 31 forks source link

Check that a candidate for deduplication is in a source that is configured for deduplication #156

Closed jschultze closed 7 months ago

jschultze commented 7 months ago

We experienced the following behaviour with deduplication:

Sources configured:

When running deduplication (with or without explicitly stating the sources to be deduplicated with --source), records from sourced 1 to 3 where not only deduplicated within this group, but also against source 4. We where expecting only the records from sources that are configured for deduplication to be deduplicated.

The RecordManager seems to get candidates for deduplication from the whole database. The additional code checks if the source of a deduplication candidate is configured for deduplication.

EreMaijala commented 7 months ago

@jschultze Has source 4 had dedup = true at some point? What the docs fail to explain properly is that if you turn dedup on or off, you need to run renormalize on the source to update the dedup keys. Sources that have dedup disabled shouldn't have dedup keys, so the records should not be found in deduplication. Regardless, the check here makes sense, but I just wanted to get to the bottom of the issue.

EreMaijala commented 7 months ago

(Wiki updated with a note to run renormalize)

jschultze commented 7 months ago

@EreMaijala Thanks for the explanation! Yes, I think that source 4 had the dedup flag set to true at first and I have not run the renormalize-command, so that is probably the reason.

I will execute the renormalization to clean the database.

EreMaijala commented 7 months ago

Oops, there's a style problem. Can you fix that too?

jschultze commented 7 months ago

The whitespace is removed.