Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.
For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to settings.json and then easily implemented in soscan.spiders.jsonldspider.JsonldSpider.sitemap_filter.
Setting could be called "start_point" or something similar.
Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.
For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to
settings.json
and then easily implemented insoscan.spiders.jsonldspider.JsonldSpider.sitemap_filter
.Setting could be called
"start_point"
or something similar.