calvez / xcoaitoolkit

Automatically exported from code.google.com/p/xcoaitoolkit
0 stars 0 forks source link

Resumption Token Idempotency Not Preserved #80

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
From: http://www.openarchives.org/OAI/openarchivesprotocol.html#Idempotency :

"When there are changes in the repository. There may be changes to the complete 
list returned by the list request sequence. These changes occur when the 
records disseminated in the list move in or out of the datestamp range of the 
request because of changes, modifications, or deletions in the repository. In 
this case, strict idempotency of the incomplete-list requests using 
resumptionToken values is not required. Instead, the incomplete list returned 
in response to a re-issued request must include all records with unchanged 
datestamps within the range of the initial list request. The incomplete list 
returned in response to a re-issued request may contain records with datestamps 
that either moved into or out of the range of the initial request. In cases 
where there are substantial changes to the repository, it may be appropriate 
for a repository to return a badResumptionToken error, signaling that the 
harvester should restart the list request sequence."

I found that when I added a new record into the repo *after* I had previously 
run a ListRecords request, the new record was added to the completed list of 
records (and the total count increased by one, too).  

There seems to be two options with which to handle this scenario (mentioned in 
the above quote):

1. Do not include any modified records which occur after the initial 
ListRecords request.

2. Issue a badResumptionToken error prompting the requester to re-issue the 
ListRecords request.

I'm leaning towards implementing 1) since internally we'll need to store the 
datestamp of the initial request anyway.  We might as well honor the rest of 
the request instead of throwing up an exception.  However, I do notice that the 
response on resumptionToken requests returns a datestamp of the current time.  
Shouldn't it instead return the time at which the ListRecords was first issued?

Until this issue is fixed, I would recommend that the OAI server be shutdown 
during indexing.  AFAIK, that's the only way to guarantee idempotency.

Original issue reported on code.google.com by cede...@uillinois.edu on 27 Apr 2011 at 5:03

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I'm now leaning towards option 2.

I now worry about the result order.  That’s because the following (taken from 
the MST implementation: 
http://code.google.com/p/xcmetadataservicestoolkit/wiki/OaiPmhImpl ) is not 
true in the case of the OAI toolkit: “If records are added to the provider 
repo before all the records are harvested, it doesn't matter because the list 
was generated when the provider received the request.”  The lucene index 
implementation (which is recommended over the mysql implementation) does not 
currently generate the list at once; instead, it does so one response chunk at 
a time.  Since the result list generation is not atomic, I can not guarantee no 
records will get missed.  Therefore, given the current design implications, I 
think it’s best to implement option 2, which is to throw a badResumptionToken 
when/if the current result set were to become stale.  This option is (much) 
easier to implement and is also in theory more “idempotent” than option 1 
(while I admit option 2 seems like a more practical approach — since I’m 
guessing harvesters are mostly interested in simply being assured they are 
getting ALL requested information -- i.e., no “holes” between subsequent 
requests).  By going with option 2, harvesters will probably want to be 
strategic when it comes to scheduling their harvests (i.e., do so when you 
believe nightly indexing is not currently in process).

Original comment by cede...@uillinois.edu on 28 Apr 2011 at 8:44

GoogleCodeExporter commented 9 years ago
I think I have a solution that does not require substantial rework.

Currently....The OAI Toolkit executes the same query for each batch of records 
it serves up and it REQUIRES/ASSUMES that the "found" set of records would 
never change.  It uses a resumption token that gives an "off set counter" (e.g. 
start at record 3000 from this "found" set of records).   This means that it 
needs to find the entire "found set" each time and batch is served.   If the 
"found" set of records were to change, which would happen as a result of 
updates and deletes, this would result in the offset counter being meaningless 
and gaps or holes being created.

Proposal....I would like to borrow concepts from the MST.  The MST creates a 
resumption token that has more/different parameters in it and it then uses 
these parameters to execute the matching process. The format is as follows:

from|until|set|metadataPrefix|startingId

Note:  startingId is defined to be the highest record id in the batch of 
records that have been served up.   The next batch of records would be 
generated by looking for matches based on the "startingID +1" (the next 
record).  Note that you would have to assume/imply from and until dates even if 
none were provided.

Chris, if you were to change the resumption token format to this, and execute 
your match process based on this, I think this will get us where we need to be. 
 So instead of finding the entire set of matches each time and then using the 
offset to find your starting point for your next batch of records to serve up, 
you would simply use the resumption token parameters to find the next batch of 
records that matched the parameters.

Note:  Records changed during harvest, that have a newer "until" date would not 
get served up in this harvest request, but would, in the next harvest.  This 
assumes that the requestor sends the subsequent request that has a "from" date 
equal to the previous "until" date.  The MST does this.   The MST does some 
additional magic so that it addresses this use case (of records being updated 
during a harvest), but this would require more work and I don't think it is 
necessary at this point.

Original comment by rc...@library.rochester.edu on 3 May 2011 at 7:22

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r145.

Original comment by cede...@uillinois.edu on 10 May 2011 at 1:39

GoogleCodeExporter commented 9 years ago
This issue was solved by following the proposal in Comment 3 with a few 
exceptions: 

1) We keep track of all modification times
2) The resumptionToken is tokenID|lastOaiIdRead|totalRecordCount (the from, 
until, set parameters are stored internally in mysql with tokenID as a handle)

FYI, the reason we pass the totalRecordCount in the resumptionToken is because 
this value never changes (no reason to calculate, which is expensive).

Original comment by cede...@uillinois.edu on 10 May 2011 at 1:48