DataONEorg / mnlite

Light weight read-only DataONE member node in Python Flask
Apache License 2.0
0 stars 0 forks source link

Scraping Harvard Dataverse HTML rather than requesting JSON-LD directly? #23

Closed iannesbitt closed 10 months ago

iannesbitt commented 1 year ago

I tried harvesting the Harvard Dataverse repository (info url, sitemap.xml, DataONEorg/member-repos#52). I had to stop the process by request of their technical contact because the crawler was bogging down their services. He reported that the crawler was not requesting JSON-LD as we promised it would. Seems like we need to address an issue of efficiency before we begin to harvest metadata from them.

Below is the head of the scrapy log and the first record from the Dataverse crawl.

2023-03-27 20:54:08 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: soscan)
2023-03-27 20:54:08 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.0 (default, Dec  9 2021, 17:53:27) - [GCC 8.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Linux-5.4.0-81-generic-x86_64-with-glibc2.27
2023-03-27 20:54:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2023-03-27 20:54:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'soscan',
 'LOG_FILE': '/var/log/mnlite/mnTestDATAVERSE-crawl-2023-03.log',
 'NEWSPIDER_MODULE': 'soscan.spiders',
 'REACTOR_THREADPOOL_MAXSIZE': 8,
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['soscan.spiders'],
 'USER_AGENT': 'soscan (+https://dataone.org/)'}
2023-03-27 20:54:08 [scrapy.extensions.telnet] INFO: Telnet Password: 36e3f625bf5785d6
2023-03-27 20:54:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-03-27 20:54:08 [JsonldSpider] DEBUG: ALT_RULES = None
2023-03-27 20:54:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'soscan.middlewares.SoscanDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-03-27 20:54:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'soscan.middlewares.SoscanSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-03-27 20:54:08 [scrapy.middleware] INFO: Enabled item pipelines:
['soscan.sonormalizepipeline.SoscanNormalizePipeline',
 'soscan.opersistpipeline.OPersistPipeline']
2023-03-27 20:54:08 [scrapy.core.engine] INFO: Spider opened
2023-03-27 20:54:08 [OPersistPipeline] DEBUG: open_spider
2023-03-27 20:54:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-03-27 20:54:08 [JsonldSpider] INFO: Spider opened: JsonldSpider
2023-03-27 20:54:08 [JsonldSpider] INFO: Spider opened: JsonldSpider
2023-03-27 20:54:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-03-27 20:54:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dataverse.harvard.edu/robots.txt> (referer: None)
2023-03-27 20:54:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dataverse.harvard.edu/sitemap/sitemap.xml> (referer: None)
2023-03-27 20:55:09 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2023-03-27 20:56:09 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-03-27 20:56:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1> [False] (referer: https://dataverse.harvard.edu/sitemap/sitemap.xml)
2023-03-27 20:56:43 [JsonldSpider] DEBUG: ITEM without jsonld: {'status': 200,
 'time_loc': datetime.datetime(2023, 3, 26, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>),
 'time_modified': None,
 'time_retrieved': datetime.datetime(2023, 3, 27, 20, 56, 43, 439537, tzinfo=datetime.timezone.utc),
 'url': 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1'}
2023-03-27 20:56:43 [SoscanNormalize] DEBUG: process_item: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1
2023-03-27 20:56:43 [sonormal] DEBUG: Framing
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET / HTTP/1.1" 200 2222
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET /docs/jsonldcontext.jsonld HTTP/1.1" 200 172009
2023-03-27 20:56:43 [sonormal] INFO: fdoc OK
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET / HTTP/1.1" 200 2222
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:43 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET /docs/jsonldcontext.jsonld HTTP/1.1" 200 172009
2023-03-27 20:56:43 [OPersistPipeline] INFO: Persisting sha256:e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85
2023-03-27 20:56:43 [OPersist] INFO: Persisting sha256:e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85
2023-03-27 20:56:43 [OPersist] INFO: Path = /tmp/tmptvfdqfg5
2023-03-27 20:56:43 [FLOB] DEBUG: wrote 2199 bytes to /home/vieglais/WORK/mnlite/instance/nodes/mnTestDATAVERSE/data/e/3/a/e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85.bin
2023-03-27 20:56:43 [OPersist] INFO: Adding database entry...
2023-03-27 20:56:43 [OPersist] WARNING: Requested subject not found: http://orcid.org/0000-0001-5828-6070
2023-03-27 20:56:43 [OPersist] DEBUG: Using submitter: None
2023-03-27 20:56:43 [OPersist] WARNING: Requested subject not found: http://orcid.org/0000-0001-5828-6070
2023-03-27 20:56:43 [OPersist] DEBUG: Using rights_holder: None
2023-03-27 20:56:43 [OPersist] DEBUG: {
  "identifier": "sha256:e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85",
  "series_id": "https://doi.org/10.7910/DVN/1ZHNO1",
  "size_bytes": 2199,
  "checksum_sha256": "e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85",
  "checksum_sha1": "7803e8d33fa237d5b89ccc7abe2b0eb53d221bd6",
  "checksum_md5": "66e774c434286ee21d195f47c4b20efa",
  "identifiers": [],
  "t_added": null,
  "t_content_modified": null,
  "content": "e/3/a/e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85.bin",
  "media_type_name": "application/ld+json",
  "file_name": "tmptvfdqfg5",
  "source": "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1",
  "format_id": "science-on-schema.org/Dataset;ld+json",
  "date_modified": null,
  "date_uploaded": "2023-03-26T00:00:00+0000",
  "serial_version": null,
  "replication_allowed": null,
  "number_replicas": null,
  "replication_preferred": [],
  "replication_blocked": [],
  "archived": null,
  "authoritative_member_node": null,
  "origin_member_node": null,
  "obsoletes": null,
  "obsoleted_by": null,
  "submitter": null,
  "rights_holder": null,
  "access_policy": [
    {
      "id": 1,
      "permission": "read",
      "t": "2023-02-16T19:42:22Z",
      "t_mod": "2023-02-16T19:42:22Z",
      "subjects": [
        {
          "subject": "public",
          "name": "Anonymous user",
          "t": "2023-02-16T19:42:22Z",
          "t_mod": "2023-02-16T19:42:22Z"
        }
      ]
    }
  ]
}
2023-03-27 20:56:43 [root] DEBUG: At doThingChecks: {
  "identifier": "sha256:e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85",
  "series_id": "https://doi.org/10.7910/DVN/1ZHNO1",
  "size_bytes": 2199,
  "checksum_sha256": "e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85",
  "checksum_sha1": "7803e8d33fa237d5b89ccc7abe2b0eb53d221bd6",
  "checksum_md5": "66e774c434286ee21d195f47c4b20efa",
  "identifiers": [],
  "t_added": null,
  "t_content_modified": null,
  "content": "e/3/a/e3aa01a94b32d22d3cb4d6ff921a320b4330dea7b563a3f810b2158552d4ad85.bin",
  "media_type_name": "application/ld+json",
  "file_name": "tmptvfdqfg5",
  "source": "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1",
  "format_id": "science-on-schema.org/Dataset;ld+json",
  "date_modified": null,
  "date_uploaded": "2023-03-26T00:00:00+0000",
  "serial_version": null,
  "replication_allowed": null,
  "number_replicas": null,
  "replication_preferred": [],
  "replication_blocked": [],
  "archived": null,
  "authoritative_member_node": null,
  "origin_member_node": null,
  "obsoletes": null,
  "obsoleted_by": null,
  "submitter": null,
  "rights_holder": null,
  "access_policy": [
    {
      "id": 1,
      "permission": "read",
      "t": "2023-02-16T19:42:22Z",
      "t_mod": "2023-02-16T19:42:22Z",
      "subjects": [
        {
          "subject": "public",
          "name": "Anonymous user",
          "t": "2023-02-16T19:42:22Z",
          "t_mod": "2023-02-16T19:42:22Z"
        }
      ]
    }
  ]
}
2023-03-27 20:56:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1>
{'alt_identifiers': None,
 'format_id': 'science-on-schema.org/Dataset;ld+json',
 'identifier': None,
 'jsonld': {'@context': 'http://schema.org',
            '@id': 'https://doi.org/10.7910/DVN/1ZHNO1',
            '@type': 'Dataset',
            'author': [{'@type': 'Person',
                        'affiliation': {'@type': 'Organization',
                                        'name': 'University Medical Center '
                                                'Hamburg'},
                        'familyName': 'Randhawa',
                        'givenName': 'Aman',
                        'name': 'Randhawa, Aman'}],
            'creator': [{'@type': 'Person',
                         'affiliation': {'@type': 'Organization',
                                         'name': 'University Medical Center '
                                                 'Hamburg'},
                         'familyName': 'Randhawa',
                         'givenName': 'Aman',
                         'name': 'Randhawa, Aman'}],
            'dateModified': '2023-03-26',
            'datePublished': '2023-03-26',
            'description': 'The datasets to our study with the title: "The '
                           'Effects of Assessing Character Strengths vs. '
                           'Psychopathology on Mood, Hope, Perceived Stigma '
                           'and Cognitive Performance in Individuals with '
                           'Psychosis". One dataset is the original german '
                           'version. The other one is a version translated '
                           'into english.',
            'distribution': [{'@type': 'DataDownload',
                              'contentSize': 161141,
                              'contentUrl': 'https://dataverse.harvard.edu/api/access/datafile/6990812',
                              'description': 'original german version',
                              'encodingFormat': 'text/tab-separated-values',
                              'name': 'Character_strengths_study_data_original_german_version.tab'},
                             {'@type': 'DataDownload',
                              'contentSize': 159160,
                              'contentUrl': 'https://dataverse.harvard.edu/api/access/datafile/6990811',
                              'encodingFormat': 'text/tab-separated-values',
                              'name': 'Character_strengths_study_data_translated_english_version.tab'}],
            'identifier': 'https://doi.org/10.7910/DVN/1ZHNO1',
            'includedInDataCatalog': {'@type': 'DataCatalog',
                                      'name': 'Harvard Dataverse',
                                      'url': 'https://dataverse.harvard.edu'},
            'keywords': ['Medicine, Health and Life Sciences'],
            'license': 'http://creativecommons.org/publicdomain/zero/1.0',
            'name': 'Character strengths study data',
            'provider': {'@type': 'Organization', 'name': 'Harvard Dataverse'},
            'publisher': {'@type': 'Organization', 'name': 'Harvard Dataverse'},
            'version': '1'},
 'normalized': [{'@id': 'https://doi.org/10.7910/DVN/1ZHNO1',
                 '@type': ['http://schema.org/Dataset'],
                 'http://schema.org/author': [{'@type': ['http://schema.org/Person'],
                                               'http://schema.org/affiliation': [{'@type': ['http://schema.org/Organization'],
                                                                                  'http://schema.org/name': [{'@value': 'University '
                                                                                                                        'Medical '
                                                                                                                        'Center '
                                                                                                                        'Hamburg'}]}],
                                               'http://schema.org/familyName': [{'@value': 'Randhawa'}],
                                               'http://schema.org/givenName': [{'@value': 'Aman'}],
                                               'http://schema.org/name': [{'@value': 'Randhawa, '
                                                                                     'Aman'}]}],
                 'http://schema.org/creator': [{'@list': [{'@type': ['http://schema.org/Person'],
                                                           'http://schema.org/affiliation': [{'@type': ['http://schema.org/Organization'],
                                                                                              'http://schema.org/name': [{'@value': 'University '
                                                                                                                                    'Medical '
                                                                                                                                    'Center '
                                                                                                                                    'Hamburg'}]}],
                                                           'http://schema.org/familyName': [{'@value': 'Randhawa'}],
                                                           'http://schema.org/givenName': [{'@value': 'Aman'}],
                                                           'http://schema.org/name': [{'@value': 'Randhawa, '
                                                                                                 'Aman'}]}]}],
                 'http://schema.org/dateModified': [{'@type': 'http://schema.org/Date',
                                                     '@value': '2023-03-26'}],
                 'http://schema.org/datePublished': [{'@type': 'http://schema.org/Date',
                                                      '@value': '2023-03-26'}],
                 'http://schema.org/description': [{'@value': 'The datasets to '
                                                              'our study with '
                                                              'the title: "The '
                                                              'Effects of '
                                                              'Assessing '
                                                              'Character '
                                                              'Strengths vs. '
                                                              'Psychopathology '
                                                              'on Mood, Hope, '
                                                              'Perceived '
                                                              'Stigma and '
                                                              'Cognitive '
                                                              'Performance in '
                                                              'Individuals '
                                                              'with '
                                                              'Psychosis". One '
                                                              'dataset is the '
                                                              'original german '
                                                              'version. The '
                                                              'other one is a '
                                                              'version '
                                                              'translated into '
                                                              'english.'}],
                 'http://schema.org/distribution': [{'@type': ['http://schema.org/DataDownload'],
                                                     'http://schema.org/contentSize': [{'@value': 161141}],
                                                     'http://schema.org/contentUrl': [{'@id': 'https://dataverse.harvard.edu/api/access/datafile/6990812'}],
                                                     'http://schema.org/description': [{'@value': 'original '
                                                                                                  'german '
                                                                                                  'version'}],
                                                     'http://schema.org/encodingFormat': [{'@value': 'text/tab-separated-values'}],
                                                     'http://schema.org/name': [{'@value': 'Character_strengths_study_data_original_german_version.tab'}]},
                                                    {'@type': ['http://schema.org/DataDownload'],
                                                     'http://schema.org/contentSize': [{'@value': 159160}],
                                                     'http://schema.org/contentUrl': [{'@id': 'https://dataverse.harvard.edu/api/access/datafile/6990811'}],
                                                     'http://schema.org/encodingFormat': [{'@value': 'text/tab-separated-values'}],
                                                     'http://schema.org/name': [{'@value': 'Character_strengths_study_data_translated_english_version.tab'}]}],
                 'http://schema.org/identifier': [{'@list': [{'@value': 'https://doi.org/10.7910/DVN/1ZHNO1'}]}],
                 'http://schema.org/includedInDataCatalog': [{'@type': ['http://schema.org/DataCatalog'],
                                                              'http://schema.org/name': [{'@value': 'Harvard '
                                                                                                    'Dataverse'}],
                                                              'http://schema.org/url': [{'@id': 'https://dataverse.harvard.edu'}]}],
                 'http://schema.org/keywords': [{'@value': 'Medicine, Health '
                                                           'and Life '
                                                           'Sciences'}],
                 'http://schema.org/license': [{'@id': 'http://creativecommons.org/publicdomain/zero/1.0'}],
                 'http://schema.org/name': [{'@value': 'Character strengths '
                                                       'study data'}],
                 'http://schema.org/provider': [{'@type': ['http://schema.org/Organization'],
                                                 'http://schema.org/name': [{'@value': 'Harvard '
                                                                                       'Dataverse'}]}],
                 'http://schema.org/publisher': [{'@type': ['http://schema.org/Organization'],
                                                  'http://schema.org/name': [{'@value': 'Harvard '
                                                                                        'Dataverse'}]}],
                 'http://schema.org/version': [{'@value': '1'}]}],
 'series_id': 'https://doi.org/10.7910/DVN/1ZHNO1',
 'status': 200,
 'time_loc': datetime.datetime(2023, 3, 26, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>),
 'time_modified': None,
 'time_retrieved': datetime.datetime(2023, 3, 27, 20, 56, 43, 439537, tzinfo=datetime.timezone.utc),
 'url': 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1ZHNO1'}
2023-03-27 20:56:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VSFKOI> [False] (referer: https://dataverse.harvard.edu/sitemap/sitemap.xml)
2023-03-27 20:56:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RNCWXF> [False] (referer: https://dataverse.harvard.edu/sitemap/sitemap.xml)
2023-03-27 20:56:43 [JsonldSpider] DEBUG: ITEM without jsonld: {'status': 200,
 'time_loc': datetime.datetime(2023, 3, 26, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>),
 'time_modified': None,
 'time_retrieved': datetime.datetime(2023, 3, 27, 20, 56, 43, 952477, tzinfo=datetime.timezone.utc),
 'url': 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VSFKOI'}
2023-03-27 20:56:43 [SoscanNormalize] DEBUG: process_item: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VSFKOI
2023-03-27 20:56:44 [sonormal] DEBUG: Framing
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET / HTTP/1.1" 200 2222
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET /docs/jsonldcontext.jsonld HTTP/1.1" 200 172009
2023-03-27 20:56:44 [sonormal] INFO: fdoc OK
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET / HTTP/1.1" 200 2222
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): schema.org:443
2023-03-27 20:56:44 [urllib3.connectionpool] DEBUG: https://schema.org:443 "GET /docs/jsonldcontext.jsonld HTTP/1.1" 200 172009
2023-03-27 20:56:44 [OPersistPipeline] INFO: Persisting sha256:f48ffe2e0b2035c470f931ecf2b798dabe5130ce83d50e479551c4c96377ae70
2023-03-27 20:56:44 [OPersist] INFO: Persisting sha256:f48ffe2e0b2035c470f931ecf2b798dabe5130ce83d50e479551c4c96377ae70
2023-03-27 20:56:44 [OPersist] INFO: Path = /tmp/tmpjqugoxdt
2023-03-27 20:56:44 [FLOB] DEBUG: wrote 3947 bytes to /home/vieglais/WORK/mnlite/instance/nodes/mnTestDATAVERSE/data/f/4/8/f48ffe2e0b2035c470f931ecf2b798dabe5130ce83d50e479551c4c96377ae70.bin
2023-03-27 20:56:44 [OPersist] INFO: Adding database entry...
2023-03-27 20:56:44 [OPersist] WARNING: Requested subject not found: http://orcid.org/0000-0001-5828-6070
2023-03-27 20:56:44 [OPersist] DEBUG: Using submitter: None
2023-03-27 20:56:44 [OPersist] WARNING: Requested subject not found: http://orcid.org/0000-0001-5828-6070
2023-03-27 20:56:44 [OPersist] DEBUG: Using rights_holder: None
mbjones commented 1 year ago

Are we just using the defaults for CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP in scrapy? We probably should set a reasonable default there, or possibly use the AutoThrottle plugin -- not sure how well that works or not.

iannesbitt commented 1 year ago

From the scrapy log it looks like sonormal is also making a bunch of calls to http://schema.org:80 "GET /docs/jsonldcontext.jsonld HTTP/1.1" (two for each record lookup that each redirect to HTTPS). Perhaps we could make one at the start of the process and cache it, which would significantly reduce the footprint and speed up the process.

iannesbitt commented 1 year ago

This is working but one side-effect is that large repositories such as Dryad take much longer to scrape. I had the staging server crontab set to hourly Dryad scans and had to kill a number of minimally responsive threads that had gone over time and piled up. After killing the threads I changed the crontab entry to only scan every other hour.

iannesbitt commented 10 months ago

All issues holding this one open have been resolved 🎉 Harvard Dataverse is being harvested now.