Esri / geoportal-server-harvester

Metadata Harvester for Esri Geoportal Server
http://esri.github.io/geoportal-server/
Apache License 2.0
31 stars 24 forks source link

Slow harvest from CSW #95

Closed valentinedwv closed 5 years ago

valentinedwv commented 5 years ago

https://data.ioos.us/csw

6 records a minute. Returning 10 records at a time comes back in about 3 seconds (start 1, and 1001)

Saw this before with http://search.geothermaldata.org/csw

and attributed it to some server/pyCSW issue.

Now it's feeling like it might be something on the harvester side.

IOOS had an issue, and has now fixed it. (turned out to be a python 3 string issue)

pandzel-zz commented 5 years ago

David,

I tried it myself by harvesting ioos site into the local folder and an instance of the geoportal catalog. I've got sold 300 records/min for the folder and 200 records/min for catalog. Not bad.

At this moment, without solid evidence and perhaps spending some time on profiling this endpoint, I can only conclude that the problem is NOT on the harvester side.

valentinedwv commented 5 years ago

If you leave "ignore robots.txt" unchecked, then you get a 10 second delay in CrawlLocker

mhogeweg commented 5 years ago

yep, check out: http://search.geothermaldata.org/robots.txt https://data.ioos.us/robots.txt

it lists the crawl delay. We respect the robots.txt settings by default (geoportal is a good bot).