Open snorri opened 7 years ago
It doesn't fix the problem. To be very strict in OAI-PMH you need to update all the datestamp in the database.
The earliestDatestamp
trick will only work if the from
contains exactly the content of earliestDatestamp
. But if the from
has a date later than earliestDatestamp
, then only recent records are return..which is false. The effect of a format change is on all the records.
On 2017-03-10 08:46, Patrick Hochstenbach wrote:
Does this solve the problem?
As example, you have records in your database with dates
|2015-01-01 2016-01-01 2017-01-01 |
You set the |earliestDatestamp| to |2016-12-25|. Now when querying:
|from=2016-12-25 |
then all three records are returned (as you explained). The |from| condition is removed any datestamp will match the condition.
But, if one requests:
from=2016-12-31
Then only the |2017-01-01| would be returned. The |from| date is later than the |earliestDatestamp|, the date will be used in querying the index, which gives the wrong results.
I'm not following you here. The result SHOULD BE just the |2017-01-01| record from the resulting query 'datestamp >= "2016-12-31"'. For the other records OAI datestamp would be equal to earliestDatestamp (2016-12-25) which is earlier than "2016-12-31" and therefore they should not be in the result.
If earliestDatestamp is set to be a date of a format change (2016-12-25) then it indicates that a record with the later datestamp '2017-01-01' has been changed AFTER the format change. The format change resulted in updated datestamps (set to earliestDatestamp) for all records at the time of the format change. You have to set the earliestDatestamp to the time the OAI-PMH service starts serving the format change and at that point all the records OAI datestamps have to be updated.
I think to be very correct in OAI-PMH, then any metadataformat change need to trigger a reindexation of the database.
As far as input and output for the OAI-PMH service it would be the same. Isn't it just implementation details how we INTERNALLY store and interpret datestamps in an underlying database?
I guess if a format changes (e.g. going from MODS3 to MODS3.1) it affects all the records in your database, old, new, whatever if these records are dynamically serialized. Datestamps are what you publically say about your record. Think of it as a timestamp and a record and a checksum. If the checksum a harvest cakculates on the record changes, it must mean that the record was updated. If the server doesn't tell you that in a datestamp, than it is not strictly OAI ok.
Hi @snorri, i think your solution could work. You don't handle the case when earliestDatestamp < date of the oldest record though. On a related note, the latest release will automatically set the earliestDatestamp to the date of the oldest record if no earliestDatestamp is given in the config.
Hi @nics. It also does work for the case when earliestDatestamp < date of oldest record.
In this case the datestamp is unchanged in each OAI record returned to the harvester (not affected), since no datestamp is earlier than the earliestDatestamp.
Then in the queries:
from
date is earlier or equal to earliestDatestamp
then the from
condition is removed (no records precede the from
date since the oldest record is later than earliestDatestamp
, therefore later than the from
date)until
date is earlier than the earliestDatestamp
then no results should be returned (since the oldest record is later than the earliestDatestamp
).earliestDatestamp = 2004 (< than the oldest record R1)
Datestamp for Example Records: R1 = 2006 R2 = 2007 R3 = 2008
from=2003, until=2005
CQL: datestamp <= 2005
(the from condition is removed)
Result: No records match in the database
from=2002, until=2003 CQL: [not used or needed] Result: No records returned (because until is < earliestDatestamp)
from=2005, until=2007
CQL: datestamp >=2005 AND datestamp <= 2007
(no modifications)
Result: R1,R2
So with the changes I propose then one can either:
1) Use as it is now and for any format changes update the datestamp for every record in the database
As long you make sure the earliestDatastamp
is correct (no record has an earlier datestamp than earliestDatestamp
) then the queries will also yield correct results.
2) Use the earliestDatastamp
as date for the latest format change and not update all the datestamps for format changes in the database (which I find problematic), but instead update the datestamp for each OAI record on the fly if it is earlier than the earliestDatastamp
.
The idea is simply to treat each record in the database that has an earlier datastamp than earliestDatestamp
the same as if we had updated the datestamp field to that earliestDatestamp
value. It is not that complicated to adjust the queries to work for both methods. The only requirement is actually that earliestDatestamp
is correct, which should be even easier to manage with the change in the newest release.
For the OAI harvesters it will not make any difference.
Ok, this is a step forward but non-trivial enough to be explained in the documentation when an OAI implementer needs to touch this earliestDatestamp. What are the hints when this procedure is needed? I would like to make the "etc" part in the original OAI spec reference in the question explicit in the documentation.
This is date change is needed when:
I think it probably should be updated for all changes, even if it only affects a few records. It can be very difficult to determine which records are affected by each change. It's better than potentially missing a change for a record, right? If it is really important that an unchanged record never gets a new datestamp then you could keep track of every combination of record and metadata format (with header), and use checksums to see if they have changed and need an updated datestamp. But I think in most cases you don't change the format etc that often so it wouldn't be worth the effort.
According to the OAI specification: "A repository must update the datestamp of a record if a change occurs, the result of which would be a change to the metadata part of the XML-encoding of the record. Such changes include, but are not limited to, changes to the metadata of the record, changes to the metadata format of the record, introduction of a new metadata format, termination of support for a metadata format, etc."
So in practice it will often end up that the
earliestDateStamp
will be the time when one of the formats was last changed.So lets say if we change one of the formats for an OAI service of an existing database. As things are set up now in the OAI plugin it would require re-indexing of all the records in Elasticsearch since the OAI datestamp field should be updated to correspond to the format change and the re-indexing needs to be in place before the OAI service can serve the records in the modified format, otherwise the selective harvesting (using from and until) will not work correctly. But there is a dilemma since we can't reliably set the date of the format change until we make the switch in the OAI service. So we would either have take the OAI service off line during the indexing or else we risk screwing up the selective harvesting for harvesters accessing the service during this time.
Solution
We can avoid re-indexing for format changes if we instead just update the datestamp in the response, on the fly for each record when needed. So if the datestamp is earlier than the
earliestDateStamp
it is set toearliestDateStamp
in the OAI output. The query for the selective harvesting also needs to be adjusted so it still works correctly. Thefrom
condition needs to removed from the CQL query if thefrom
date is equal to or earlier than theearliestDateStamp
. If theuntil
date is earlier than theearliestDatestamp
no results should be returned.This should be easy to implement and also wouldn't break how it is working now assuming the record datestamps and
earliestDateStamp
are managed correctly.Does this sound reasonable?