Esri / geoportal-server-harvester

Metadata Harvester for Esri Geoportal Server
http://esri.github.io/geoportal-server/
Apache License 2.0
31 stars 24 forks source link

OAI-PHM Too many requests #89

Closed valentinedwv closed 5 years ago

valentinedwv commented 5 years ago

Pondering how to handle this in the codebase: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429 A watched session will not trigger the rate limited.

Not quite correct implementation, so 503 might also be trapped: http://www.openarchives.org/OAI/2.0/guidelines-repository.htm#FlowControlAndLoadBalancing


endpoint: https://ws.pangaea.de/oai/provider

at about 60 records, we get a Too Many Requests.. http 429

19-Oct-2018 13:05:04.634 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: NAME: , PROCESSOR: DEFAULT[], SOURCE: OAI-PMH[oai-host-url=https://ws.pangaea.de/oai/provider, oai-prefix=iso19139, oai-set=], DESTINATIONS: [FOLDER-SPLIT[folder-root-folder=d:\metadata\, folder-split-folders=true, folder-split-size=1000, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true | Error reading data from: OAI [https://ws.pangaea.de/oai/provider]
 com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data from: OAI [https://ws.pangaea.de/oai/provider]
    at com.esri.geoportal.harvester.oai.pmh.OaiBroker.readContent(OaiBroker.java:147)
    at com.esri.geoportal.harvester.oai.pmh.OaiBroker.access$200(OaiBroker.java:56)
    at com.esri.geoportal.harvester.oai.pmh.OaiBroker$OaiIterator.next(OaiBroker.java:206)
    at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$25(DefaultProcessor.java:154)
    at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$129/22196721.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Too Many Requests
    at com.esri.geoportal.commons.oai.client.Client.readRecord(Client.java:154)
    at com.esri.geoportal.harvester.oai.pmh.OaiBroker.readContent(OaiBroker.java:140)
    ... 5 more
pandzel-zz commented 5 years ago

Pull request #101 attempts to address this issue. It reads "Retry-After" response header and applies given delay. That, of course, will cause harvester appear to be slow.

I think, 429 error code has been invented to let servers protect themselves against DDoS type of attacks (or unwanted web crawlers). Hence, a desired solution would be a "white list" kind of mechanism, where harvester IP is being listed as "white" on the server side as a part of some agreement or partnership and is allowed for an unlimited access to the resources, while all the rest would only get a sneak peek of the content.