OpenWIS / openwis4

1 stars 3 forks source link

45. v4 SYS - MM2 Limit on OAI-PMH metadata harvest response #26

Closed rogers492 closed 6 years ago

rogers492 commented 7 years ago

This task was part of the OpenWISv4/GeoNetworksv3 work package assigned to GeoCat in 2015. The documentation produced as part of that work describes the new functionality delivered.

The outcome for this feature was:

2015 status: MO Test; Percent complete: 90; Priority: Normal; Questions outstanding: nil;

Extra info: nil.

rogers492 commented 7 years ago

Update - scrum - 2016-11-17: GT - Migrated into new v4.0 as part on Kanban 81. v4 - Refactoring to improve modularity.

tg4444 commented 7 years ago

The "Maximum records" parameter does exist in the Admin settings, but I'm not sure how to test if it's actually applied. Should I expect to see the harvesting stop when the "Maximum records" is reached? Or should I expect to see the record count increment in steps of "Maximum records"? Or something completely different than the above?

I have tested with values of "10" and "500", and got similar speed, and more than 10/500 entries in both cases. Also, there was nothing in the log/UI that would indicate that the entries were retrieved in batches of 10/500.

Any feedback would be appreciated.

woollattd commented 7 years ago

I think, looking at OpenWIS v3.14 as a reference, that harvesting is expected to be carried out in batches of a configurable number which is then indexed follwed by harvesting the next batch- until the full number of records is reached. In the case of v3.14 on our system it's set at 100 records. Afraid I've no idea how it might be set and observed in GeoNetwork. In 3.14 the logs on the Admin portal can clearly show harvesting occurring in batches of 100.

This may or may not be the best way for GeoNetwork 3 to handle metadata. I'm not sure when indexing occurs.

There is a little info on OAIPMH groups of records here - under 'Resumption Token Timeout' mentioning 10 records - maybe this has something to do with how GeoNetwork 3 handles this? http://geonetwork-opensource.org/manuals/trunk/eng/users/administrator-guide/configuring-the-catalog/system-configuration.html#open-archive-initiative-oai-pmh-provider

ywang-bom commented 7 years ago

As I understand it, the OAI-PMH "max records" setting is for the provider, not the client. GISC Melbourne has this set to 100, which means that any other centres trying to harvest metadata from us receive records in batches of 100. From a client side, I don't recall anything related to control the batch size.? I hope this is relevant to the discussions. Otherwise please disregard.

Regards, Yang


From: Dom Woollatt notifications@github.com Sent: Tuesday, 14 March 2017 2:34 AM To: OpenWIS/openwis4 Cc: Subscribed Subject: Re: [OpenWIS/openwis4] 45. v4 SYS - MM2 Limit on OAI-PMH metadata harvest response (#26)

I think, looking at OpenWIS v3.14 as a reference, that harvesting is expected to be carried out in batches of a configurable number which is then indexed follwed by harvesting the next batch- until the full number of records is reached. In the case of v3.14 on our system it's set at 100 records. Afriad I've no idea how it might be set and observed in GeoNetwork. In 3.14 the logs on the Admin portal can clearly show harvesting occurring in batches of 100.

This may or may not be the best way for GeoNetwork 3 to handle metadata. I'm not sure when indexing occurs.

There is a little info on OAIPMH groups of records here - under 'Resumption Token Timeout' mentioning 10 records - maybe this has something to do with how GeoNetwork 3 handles this? http://geonetwork-opensource.org/manuals/trunk/eng/users/administrator-guide/configuring-the-catalog/system-configuration.html#open-archive-initiative-oai-pmh-provider

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/OpenWIS/openwis4/issues/26#issuecomment-286144321, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGRCES84_i28kPbt4rR5b8IG3y8elAo6ks5rlWIIgaJpZM4L_PzT.

woollattd commented 7 years ago

Ahh That could be what that it is. Makes sense. I don't think it's used for speed, more to do with reliability during a harvest.

tg4444 commented 7 years ago

Below are my findings on how the harvesting process works, in terms of the OAI/PMH limit:

The process is performed in two steps. First, a list of headers is returned from the OAI/PMH server. Once the headers are returned, the actual metadata are fetched one-by-one, for each header, and the database entries are created/updated.

The limit itself is used in the first step only, and in the following manner. Suppose there are 100 result headers and out limit is 60. The OW4 server will correctly identify this and:

  1. Return the first 60 headers in the response
  2. Add a resumption token

On the client side, the code iterates the results, and when the end is reached - if there is a resumption token - it will automatically perform another request to fetch the remaining headers.

As you can understand, this means that the limit is only applied to the initial part of the process, which usually lasts some seconds, or a couple of minutes max. From then on, all headers are already fetched so the client iterates through them and returns the associated metadata from the OAI/PMH server.

rogers492 commented 6 years ago

Development of OpenWISv4 is stopped.