Closed iannesbitt closed 1 year ago
This was trickier than I thought, because the documentation for the Sitemap object calls it an iterator
when it's actually a generator. So the Sitemap.__iter__()
object has to be converted to a list before calling reversed()
. It seems to be working well now.
45 doesn't have the intended effect for Harvard Dataverse, in that it does not skip the first
n
entries in its sitemap, perhaps due to the way entries are registered from compound sitemaps. Since soscan has to get through ~45,000 items to get to the items it hasn't already stored in the database, perhaps it would be more efficient to simply yield items from the sitemap iterator in reverse order, so that the spider can encounter the batch of new items at the end of the list first. This would just involve running python's builtinreversed()
on theentries
iterator inJsonldSpider.sitemap_filter()
.