code4lib / ruby-oai

a Ruby library for building OAI-PMH clients and servers
MIT License
62 stars 43 forks source link

non-integer identifiers work in ResumptionToken #79

Closed jrochkind closed 4 years ago

jrochkind commented 4 years ago

…and with ActiveRecordWrapper resumption pagination

Context

The OAI-PMH spec doesn't say anything about the structure of a resumptionToken it's just an opaque identifier to the client.

This gem provides for a particular format of resumptionToken, where info is embedded in it to allow (in some cases) for stateless resumption/pagination. One component of this info, which happens to be the "last" item in the String, is something identifying the "last" record in the previous page, and is called by method/init arg last in the code.

It could be a pagination numerical offset value -- the built-in ActiveRecordCachingWrapper provider uses it thusly. (I'm not sure if anyone actually uses this provider, or if it currently works).

Or it could be an identifier/primary key -- the built-in ActiveRecordWrapper provider does it this way. (I'm not sure if anyone other than me is currently using this one). This relies on records being returned in sort-by-identifier/pk order, so the identifier/pk of the last record from the last page tells you where to start the next page.

Dependent code not in this gem but using it might do either of those things with the last component in a ResumptionToken. The blacklight_oai_provider seems to currently use it as a numeric offset, but might be able to have a more efficient implementation using it as a solr uniqueKey identifier (solr is known not that efficient at deep-offset pagination, as are most rdbms).

Problem

The existing code before this PR assumed this last component is an integer. The code for parsing it out of a serialized form assumed it was regexp /d+, and lots of other code just assumed it was an integer.

This is not a valid assumption when it is an identifier/pk. Even if strictly a PK with the ActiveRecordWrapper -- these days, we have UUID pks. The feature added in #76 -- where even with ActiveRecordWrapper you might be using a non-pk unique column for identifier -- increases the space where this assumption can fail. (I ran into this in our local app, we thought we had oai-pmh working, but it turns out pagination was not working, which is why I returned to it).

If blacklight_oai_provider wanted to use Solr UniqueKey's here instead of offsets -- it couldn't, as people's Solr UniqueKey's are not generally strictly integers.

So we want to make the last component be supported as any string not just an integer. That's what's done here, sort of.

The solution

First I just tried changing it so it didn't assume last was an integer. But that broke lots of tests, especially around the ActiveRecordCachingWrapper (which I'm not sure if it truly works anyway or if anyone is using it, but anyway).

And it was really hard to figure out what to change to get it working again, and it was lots of lines of changes -- this is very old, kind of creaky, not always well documented code, using ruby conventions from a decade ago (and the tests are also kind of hard to work with for me). And there are undocumented conventions, like passing 0 (or now "0" as the last component seems to have a special meaning internally maybe meaning last page?)

So I took another stab at it, where the #last is still an integer, but an additional last_str is available with a string form. And I fixed ActiveRecordWrapper so it pagination works again with non-numeric identifiers/pks.

I left copious comments in the code, hopefully making it less mysterious for whoever comes next, because this code is a bit difficult, and I realize my appraoch here is sort of hacky. But I think it's the best for backwards compatibility (including with any dependent code that may assume #last is an integer), and for the purpose of changing as few lines as possible when we have code that we don't understand well, is difficult, and doesn't have a lot of hours for maintenance dedicated to it.

jrochkind commented 4 years ago

Does anyone want to review this? @bess @cbeer @dkinzer @bibliotechy @jamesprior?

Otherewise I will rely on my same-employer colleague @eddierubeiz

eddierubeiz commented 4 years ago

I'm going to merge this, as we need this fix. PA Digital is going to be harvesting our metadata in a couple of weeks (see https://github.com/sciencehistory/scihist_digicoll/issues/572 ) and, well, our IDs aren't integers.