Closed GoogleCodeExporter closed 9 years ago
[deleted comment]
I'm now leaning towards option 2.
I now worry about the result order. That’s because the following (taken from
the MST implementation:
http://code.google.com/p/xcmetadataservicestoolkit/wiki/OaiPmhImpl ) is not
true in the case of the OAI toolkit: “If records are added to the provider
repo before all the records are harvested, it doesn't matter because the list
was generated when the provider received the request.” The lucene index
implementation (which is recommended over the mysql implementation) does not
currently generate the list at once; instead, it does so one response chunk at
a time. Since the result list generation is not atomic, I can not guarantee no
records will get missed. Therefore, given the current design implications, I
think it’s best to implement option 2, which is to throw a badResumptionToken
when/if the current result set were to become stale. This option is (much)
easier to implement and is also in theory more “idempotent” than option 1
(while I admit option 2 seems like a more practical approach — since I’m
guessing harvesters are mostly interested in simply being assured they are
getting ALL requested information -- i.e., no “holes” between subsequent
requests). By going with option 2, harvesters will probably want to be
strategic when it comes to scheduling their harvests (i.e., do so when you
believe nightly indexing is not currently in process).
Original comment by cede...@uillinois.edu
on 28 Apr 2011 at 8:44
I think I have a solution that does not require substantial rework.
Currently....The OAI Toolkit executes the same query for each batch of records
it serves up and it REQUIRES/ASSUMES that the "found" set of records would
never change. It uses a resumption token that gives an "off set counter" (e.g.
start at record 3000 from this "found" set of records). This means that it
needs to find the entire "found set" each time and batch is served. If the
"found" set of records were to change, which would happen as a result of
updates and deletes, this would result in the offset counter being meaningless
and gaps or holes being created.
Proposal....I would like to borrow concepts from the MST. The MST creates a
resumption token that has more/different parameters in it and it then uses
these parameters to execute the matching process. The format is as follows:
from|until|set|metadataPrefix|startingId
Note: startingId is defined to be the highest record id in the batch of
records that have been served up. The next batch of records would be
generated by looking for matches based on the "startingID +1" (the next
record). Note that you would have to assume/imply from and until dates even if
none were provided.
Chris, if you were to change the resumption token format to this, and execute
your match process based on this, I think this will get us where we need to be.
So instead of finding the entire set of matches each time and then using the
offset to find your starting point for your next batch of records to serve up,
you would simply use the resumption token parameters to find the next batch of
records that matched the parameters.
Note: Records changed during harvest, that have a newer "until" date would not
get served up in this harvest request, but would, in the next harvest. This
assumes that the requestor sends the subsequent request that has a "from" date
equal to the previous "until" date. The MST does this. The MST does some
additional magic so that it addresses this use case (of records being updated
during a harvest), but this would require more work and I don't think it is
necessary at this point.
Original comment by rc...@library.rochester.edu
on 3 May 2011 at 7:22
This issue was closed by revision r145.
Original comment by cede...@uillinois.edu
on 10 May 2011 at 1:39
This issue was solved by following the proposal in Comment 3 with a few
exceptions:
1) We keep track of all modification times
2) The resumptionToken is tokenID|lastOaiIdRead|totalRecordCount (the from,
until, set parameters are stored internally in mysql with tokenID as a handle)
FYI, the reason we pass the totalRecordCount in the resumptionToken is because
this value never changes (no reason to calculate, which is expensive).
Original comment by cede...@uillinois.edu
on 10 May 2011 at 1:48
Original issue reported on code.google.com by
cede...@uillinois.edu
on 27 Apr 2011 at 5:03