LibreCat / Dancer-Plugin-Catmandu-OAI

OAI-PMH provider backed by a searchable Catmandu::Store
http://librecat.org
2 stars 1 forks source link

Problem with datestamps and earliestDatestamp #12

Open snorri opened 7 years ago

snorri commented 7 years ago

According to the OAI specification: "A repository must update the datestamp of a record if a change occurs, the result of which would be a change to the metadata part of the XML-encoding of the record. Such changes include, but are not limited to, changes to the metadata of the record, changes to the metadata format of the record, introduction of a new metadata format, termination of support for a metadata format, etc."

So in practice it will often end up that the earliestDateStamp will be the time when one of the formats was last changed.

So lets say if we change one of the formats for an OAI service of an existing database. As things are set up now in the OAI plugin it would require re-indexing of all the records in Elasticsearch since the OAI datestamp field should be updated to correspond to the format change and the re-indexing needs to be in place before the OAI service can serve the records in the modified format, otherwise the selective harvesting (using from and until) will not work correctly. But there is a dilemma since we can't reliably set the date of the format change until we make the switch in the OAI service. So we would either have take the OAI service off line during the indexing or else we risk screwing up the selective harvesting for harvesters accessing the service during this time.

Solution

We can avoid re-indexing for format changes if we instead just update the datestamp in the response, on the fly for each record when needed. So if the datestamp is earlier than the earliestDateStamp it is set to earliestDateStamp in the OAI output. The query for the selective harvesting also needs to be adjusted so it still works correctly. The from condition needs to removed from the CQL query if the from date is equal to or earlier than the earliestDateStamp. If the until date is earlier than the earliestDatestamp no results should be returned.

This should be easy to implement and also wouldn't break how it is working now assuming the record datestamps and earliestDateStamp are managed correctly.

Does this sound reasonable?

phochste commented 7 years ago

It doesn't fix the problem. To be very strict in OAI-PMH you need to update all the datestamp in the database.

The earliestDatestamp trick will only work if the from contains exactly the content of earliestDatestamp. But if the from has a date later than earliestDatestamp, then only recent records are return..which is false. The effect of a format change is on all the records.

snorri commented 7 years ago

On 2017-03-10 08:46, Patrick Hochstenbach wrote:

Does this solve the problem?

As example, you have records in your database with dates

|2015-01-01 2016-01-01 2017-01-01 |

You set the |earliestDatestamp| to |2016-12-25|. Now when querying:

|from=2016-12-25 |

then all three records are returned (as you explained). The |from| condition is removed any datestamp will match the condition.

But, if one requests:

from=2016-12-31

Then only the |2017-01-01| would be returned. The |from| date is later than the |earliestDatestamp|, the date will be used in querying the index, which gives the wrong results.

I'm not following you here. The result SHOULD BE just the |2017-01-01| record from the resulting query 'datestamp >= "2016-12-31"'. For the other records OAI datestamp would be equal to earliestDatestamp (2016-12-25) which is earlier than "2016-12-31" and therefore they should not be in the result.

If earliestDatestamp is set to be a date of a format change (2016-12-25) then it indicates that a record with the later datestamp '2017-01-01' has been changed AFTER the format change. The format change resulted in updated datestamps (set to earliestDatestamp) for all records at the time of the format change. You have to set the earliestDatestamp to the time the OAI-PMH service starts serving the format change and at that point all the records OAI datestamps have to be updated.

I think to be very correct in OAI-PMH, then any metadataformat change need to trigger a reindexation of the database.

As far as input and output for the OAI-PMH service it would be the same. Isn't it just implementation details how we INTERNALLY store and interpret datestamps in an underlying database?

phochste commented 7 years ago

I guess if a format changes (e.g. going from MODS3 to MODS3.1) it affects all the records in your database, old, new, whatever if these records are dynamically serialized. Datestamps are what you publically say about your record. Think of it as a timestamp and a record and a checksum. If the checksum a harvest cakculates on the record changes, it must mean that the record was updated. If the server doesn't tell you that in a datestamp, than it is not strictly OAI ok.

nics commented 7 years ago

Hi @snorri, i think your solution could work. You don't handle the case when earliestDatestamp < date of the oldest record though. On a related note, the latest release will automatically set the earliestDatestamp to the date of the oldest record if no earliestDatestamp is given in the config.

snorri commented 7 years ago

Hi @nics. It also does work for the case when earliestDatestamp < date of oldest record.

In this case the datestamp is unchanged in each OAI record returned to the harvester (not affected), since no datestamp is earlier than the earliestDatestamp.

Then in the queries:

Examples (just years to keep it simple)

earliestDatestamp = 2004 (< than the oldest record R1)

Datestamp for Example Records: R1 = 2006 R2 = 2007 R3 = 2008

  1. from=2003, until=2005 CQL: datestamp <= 2005 (the from condition is removed) Result: No records match in the database

  2. from=2002, until=2003 CQL: [not used or needed] Result: No records returned (because until is < earliestDatestamp)

  3. from=2005, until=2007 CQL: datestamp >=2005 AND datestamp <= 2007 (no modifications) Result: R1,R2

So with the changes I propose then one can either: 1) Use as it is now and for any format changes update the datestamp for every record in the database As long you make sure the earliestDatastampis correct (no record has an earlier datestamp than earliestDatestamp) then the queries will also yield correct results. 2) Use the earliestDatastampas date for the latest format change and not update all the datestamps for format changes in the database (which I find problematic), but instead update the datestamp for each OAI record on the fly if it is earlier than the earliestDatastamp.

The idea is simply to treat each record in the database that has an earlier datastamp than earliestDatestampthe same as if we had updated the datestamp field to that earliestDatestampvalue. It is not that complicated to adjust the queries to work for both methods. The only requirement is actually that earliestDatestampis correct, which should be even easier to manage with the change in the newest release.

For the OAI harvesters it will not make any difference.

phochste commented 7 years ago

Ok, this is a step forward but non-trivial enough to be explained in the documentation when an OAI implementer needs to touch this earliestDatestamp. What are the hints when this procedure is needed? I would like to make the "etc" part in the original OAI spec reference in the question explicit in the documentation.

This is date change is needed when:

snorri commented 7 years ago

I think it probably should be updated for all changes, even if it only affects a few records. It can be very difficult to determine which records are affected by each change. It's better than potentially missing a change for a record, right? If it is really important that an unchanged record never gets a new datestamp then you could keep track of every combination of record and metadata format (with header), and use checksums to see if they have changed and need an updated datestamp. But I think in most cases you don't change the format etc that often so it wouldn't be worth the effort.