DataONEorg / mnlite

Light weight read-only DataONE member node in Python Flask
Apache License 2.0
0 stars 0 forks source link

Restart a large harvest at a settable number of records through a sitemap #45

Closed iannesbitt closed 10 months ago

iannesbitt commented 10 months ago

Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.

For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to settings.json and then easily implemented in soscan.spiders.jsonldspider.JsonldSpider.sitemap_filter.

Setting could be called "start_point" or something similar.