NatLibFi / RecordManager

A metadata record management system written in PHP, intended to be used in conjunction with VuFind or another Solr-based discovery interface. Provides import, export, harvesting (OAI-PMH), normalization, deduplication and Solr index update functionality with support for multiple metadata formats. Also includes an OAI-PMH provider that can be used to access the data stored in RecordManager database. Functionality driven by simple command line programs for easy automation.
GNU General Public License v2.0
47 stars 31 forks source link

Records in dedup will not be removed and deduplication fails #148

Closed SeedDMS closed 11 months ago

SeedDMS commented 11 months ago

We had some problems with deduplication after updating a datasource. There was just no deduplication for those records anymore. I tried to boilded in down to a simple 2 datasources problem.

./console records:import DS1 /tmp/DS1.xml
./console records:import DS2 /tmp/DS2.xml
./console records:deduplicate

At this point everything is fine. The table dedup contains 5336 records. Next I run

 ./console records:mark-deleted --source=DS1

This marks all records in record and dedup as deleted. The field ids in the dedup record is also cleared and there is no reference to the dedup record in redord.dedup_id anymore. That looks ok as well. Then I try to actually purge the deleted records.

 ./console records:purge-deleted --source=DS1

which doesn't do anything. I would expect the formerly marked deleted records to be deleted, but both tables dedup and record remain unchanged. Doesn't appear to be a real problem, so I import DS1 again.

./console records:import DS1 /tmp/DS1.xml

and all the formerly marked as deleted records in table record aren't marked as deleted anymore. So I tried a

./console records:deduplicate

again, but that doesn't do anythink. I doesn't even try to deduplicate. What went wrong and secondly, why have those records in table dedup never been deleted? There are basically empty shells marked as deleted and not referencing a record?

I'm using mysql.

EreMaijala commented 11 months ago

Thanks for the report! The empty deleted dedup records are expected. They ensure that when RecordManager is used to update a Solr index, any dedup records marked deleted are also deleted from the index. For the same reason deleted records are kept in RecordManager's database for a while. The default retention period is 14 days, but you can use the days-to-keep parameter with purge-deleted to control it.

Records not getting deduplicated when you re-imported DS1.xml is not expected, however, so I'll need to investigate that. I'll update when I have more information. In a pinch you could use ./console records:deduplicate --all --source=DS1 to force deduplication.

SeedDMS commented 11 months ago

Thanks for the ./console records:deduplicate --all --source=DS1 hint. That fixes the deduplication problem in my simplified example. It may even help in my original problem where 20 data sources where imported. The initial import and deduplikation always worked, but updating some of the sources led to more and more duplicates. So, I'll try to force deduplication.

EreMaijala commented 11 months ago

I've committed a fix for the example case. If your update procedure involved marking all deleted and then loading a new file, this should fix the case as well. However, if you're seeing trouble getting newly added records to be deduplicated, there must be something else amiss.

SeedDMS commented 11 months ago

Just a final note. We are also running an version 1.9 RecordManager and we discovered the same behaviour. Backporting the commit into Base/Controller/StoreRecordTrait.php seems to have fixed it as well.