Deep pagination - Githubissues

liamwhite commented 5 years ago

Hi, I run the site this downloader is meant for.

As far as I can tell, your downloader fetches image API metadata by paging into the result set with the page parameter. https://github.com/Sibusten/derpibooru-downloader/blob/b89b8e6b422062272dd76fba6302d4048a4a7ad1/DerpibooruDownloader/downloadmanager.cpp#L176-L179 https://github.com/Sibusten/derpibooru-downloader/blob/b89b8e6b422062272dd76fba6302d4048a4a7ad1/DerpibooruDownloader/downloadmanager.cpp#L309-L310

This is an extremely inefficient way to get results from the API, because it requires the search engine to load page*per_page results into memory and sort all of them, again in memory. I regularly see requests at pages far beyond what should be requested in the course of normal operation.

Please change this program to page by specifying either the maximum image ID or maximum created_at date for each request, using a query string such as id.lt:1234 or created_at.lt:2019-06-18T00:00Z. You can find the maximum ID or created_at value in the last image in the sort order of each request.

This improved method only requires loading per_page results into memory and sorting them, leading to faster response times for you, and less server load for me.

Sibusten commented 5 years ago

Hi Liam, Thank you for the explanation on why paging is expensive for your server. I didn't realize this, and thought that accessing pages was the proper way to do it. I will work on changing the way metadata is fetched to use this method.

Does this mean that there is no efficient way to use sorting methods other than upload date?

liamwhite commented 5 years ago

It isn't quite right to say that there is no efficient way to sort, because Elasticsearch is actually extremely efficient at sorting. Let me try to explain.

When you request page 1 of a search, Elasticsearch will be doing a few different things.

First, it will need to actually run your query in the query execution phase. Most things you can access in the site search grammar directly map to Elasticsearch features (though we have a compiler that stands in the way). This is the fastest part. On each Elasticsearch node, Lucene will intersect/union bitmaps for each of the specific terms you requested, and will generate bitmaps in compressed format on the fly from a K-D tree for range queries. This results in a final bitmap describing the resultset, with an accurate count of all matching documents.

Second, the results need to be sorted. This is done in Lucene via a partial sort. Consider that sorting is in general known to be O(n log n). When you only need the highest k documents of this set, this time complexity can be changed to O(n log k) via a partial sort, and this is what Lucene calls TopDocs. This is where you might have noticed the problem. When you make a search request for 25 images at page 44530, this TopDocs optimization basically goes out the window, because Lucene must now instead sort the top 1113250 documents, and discard all but 25 of them.¹ This is the basic gist of the problem.

Finally, the documents are returned, and you get your search results.

I hope that explanation helps.

_{¹ There are more efficient ways to find documents at an arbitrary offset, but they are not that much faster in practice, and Elasticsearch in general cannot use these anyway except in single-node setups, which are uncommon.}

Sibusten commented 5 years ago

That explanation helps a lot to explain the problem, thank you.

Sibusten commented 5 years ago

I've pushed a commit which changes searches by upload date to use id constraints instead of pages (1f695da54ffdab8493e18644e8e7ec25e0d17b59), which will be added to the next release.

Sibusten commented 5 years ago

Added to v1.4.5

liamwhite commented 5 years ago

Thanks :+1:

Sibusten / derpibooru-downloader

Deep pagination #24