Closed iannesbitt closed 1 year ago
Are we just using the defaults for CONCURRENT_REQUESTS_PER_DOMAIN
or CONCURRENT_REQUESTS_PER_IP
in scrapy? We probably should set a reasonable default there, or possibly use the AutoThrottle plugin -- not sure how well that works or not.
From the scrapy log it looks like sonormal
is also making a bunch of calls to http://schema.org:80 "GET /docs/jsonldcontext.jsonld HTTP/1.1"
(two for each record lookup that each redirect to HTTPS). Perhaps we could make one at the start of the process and cache it, which would significantly reduce the footprint and speed up the process.
This is working but one side-effect is that large repositories such as Dryad take much longer to scrape. I had the staging server crontab
set to hourly Dryad scans and had to kill a number of minimally responsive threads that had gone over time and piled up. After killing the threads I changed the crontab entry to only scan every other hour.
All issues holding this one open have been resolved 🎉 Harvard Dataverse is being harvested now.
I tried harvesting the Harvard Dataverse repository (info url, sitemap.xml, DataONEorg/member-repos#52). I had to stop the process by request of their technical contact because the crawler was bogging down their services. He reported that the crawler was not requesting JSON-LD as we promised it would. Seems like we need to address an issue of efficiency before we begin to harvest metadata from them.
Below is the head of the scrapy log and the first record from the Dataverse crawl.