DataONEorg / mnlite

Light weight read-only DataONE member node in Python Flask
Apache License 2.0
0 stars 0 forks source link

Add option to read sitemap entry iterator in reverse order #51

Closed iannesbitt closed 1 year ago

iannesbitt commented 1 year ago

45 doesn't have the intended effect for Harvard Dataverse, in that it does not skip the first n entries in its sitemap, perhaps due to the way entries are registered from compound sitemaps. Since soscan has to get through ~45,000 items to get to the items it hasn't already stored in the database, perhaps it would be more efficient to simply yield items from the sitemap iterator in reverse order, so that the spider can encounter the batch of new items at the end of the list first. This would just involve running python's builtin reversed() on the entries iterator in JsonldSpider.sitemap_filter().

iannesbitt commented 1 year ago

This was trickier than I thought, because the documentation for the Sitemap object calls it an iterator when it's actually a generator. So the Sitemap.__iter__() object has to be converted to a list before calling reversed(). It seems to be working well now.