code4lib / ruby-oai

a Ruby library for building OAI-PMH clients and servers
MIT License
62 stars 42 forks source link

OAI::Client.list_records and resumption tokens #20

Closed atomotic closed 11 years ago

atomotic commented 11 years ago

https://github.com/code4lib/ruby-oai/blob/master/lib/oai/harvester/harvest.rb#L85

any idea to have OAI::Client.list_records dealing automatically with resumption tokens?

tjdett commented 11 years ago

It certainly could be extended to handle resumption tokens, but if you're getting resumption tokens you're probably being returned enough records to worry about memory management.

While I don't use it, my understanding of that harvester section is that it's reading each record, and sending it directly to gzip piped to a file. The response objects (which contain a page worth of records) are being garbage collected, so only one should ever exist at a time. Against a server exposing 50K+ records, this isn't a trivial optimization.

If list_records.each() seamlessly handled resumption tokens, then response.doc would have to either:

On top of that, the harvester doesn't use list_records.each() anyway, because it wants the <record/> node rather than its children (as provided by OAI::Record).

It's past time an easier way to handle this existed though, so let me have a look at it. (It probably won't change the harvester though.)

tjdett commented 11 years ago

See #23 for listing records, identifiers and sets automatically with resumption tokens. eg.

# Get the number of records in full
client.list_identifiers.full.count

# Get the number of sets in full
client.list_sets.full.count

client.list_records.full.each do |record|
  # Do something with records
end

# Get all the deleted records
client.list_records.full.select {|r| r.deleted}
atomotic commented 11 years ago

good job, thank you. i'll test it right now