DataONEorg / mnlite

Light weight read-only DataONE member node in Python Flask

Apache License 2.0

0 stars 0 forks source link

Add option to read sitemap entry iterator in reverse order #51

Closed iannesbitt closed 10 months ago

iannesbitt commented 10 months ago

45 doesn't have the intended effect for Harvard Dataverse, in that it does not skip the first `n` entries in its sitemap, perhaps due to the way entries are registered from compound sitemaps. Since soscan has to get through ~45,000 items to get to the items it hasn't already stored in the database, perhaps it would be more efficient to simply yield items from the sitemap iterator in reverse order, so that the spider can encounter the batch of new items at the end of the list first. This would just involve running python's builtin `reversed()` on the `entries` iterator in `JsonldSpider.sitemap_filter()`.

iannesbitt commented 10 months ago

This was trickier than I thought, because the documentation for the Sitemap object calls it an iterator when it's actually a generator. So the Sitemap.__iter__() object has to be converted to a list before calling reversed(). It seems to be working well now.