clarin-eric / oai-harvest-manager

A simple Java application for managing an OAI-PMH harvesting workflow
14 stars 12 forks source link

request multiple records from the provider #4

Closed menzowindhouwer closed 9 years ago

menzowindhouwer commented 9 years ago

The harvester currently uses GetRecord to request records one by one from the provider. It might be possible to do a ListRecords so less network traffic with the provider is needed.

kjvandelooij commented 9 years ago

Currently handling of resumption tokens is in place. It only applies to identifiers however. A filter blocks records marked as 'deleted' by the provider: marked records are not added to the list of records to be retrieved. Conversely, the ListRecords verb cannot filter deleted records. If we use resumption in the form of ListRecords instead of ListIdentifiers, we would effectively receive all the records in the repository meeting requirements about date, set specification and metadata prefix. The deleted ones included. Before performing an action an a record, like storing it, or transforming it to another format, the harvest manager could check the status of the record and skip it if it is marked as 'deleted'.

Note: since deleted records do not have a metadata element, the amount of data downloaded would be the same. In both cases, for each deleted record, we will receive the header. Currently, in the list of identifiers, alternatively in the list of records returned after a ListRecords command.

Note: in general, asking for identifiers first is less efficient, not only because there is one request per record, but also because part of the record, the header, is transmitted twice: the first time as part of the list of identifiers, the second time in response to the GetRecord request.

Note: instead of keeping each record in an envelope, we would keep all the records received before the resumption token in one envelope. Stripping this envelope would create the separate records we are used to.

Note: the endpoint determines the amount of records provided between resumption tokens. This means that it indirectly determines the number of records packed in a HTTP request. The harvest manager cannot influence this. The Huygens endpoint for example only sends ten identifiers before sending a resumption token.

So, yes, we can support the ListRecords verb, but I am not sure if it will be more efficient, we would need to try it out.

Suggestion: I could experiment with a version of the harvest manager supporting the ListRecords verb.

kjvandelooij commented 9 years ago

The next release of the harvest manager resolves this issue.