apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 979 forks source link

HTTP Storage Plugin INDEX pagination mode is field-order sensitive and should not be #2895

Open hyperbolix opened 6 months ago

hyperbolix commented 6 months ago

Describe the bug HTTP Storage Plugin INDEX pagination mode is field-order sensitive and should not be (JSON is supposed to be unordered). When the pagination fields 'has_next' and 'next_page' come after the data path field 'results', the presence of a next page is ignored. The next page fields are only honored when they come before the data path field.

To Reproduce Steps to reproduce the behavior:

  1. Start Drill.
  2. Start the Python3 HTTP server that demonstrates the issue: https://gist.github.com/hyperbolix/b66f07d5d007ca2df321550241d748ac
  3. Enable the http storage plugin and configure it using this configuration: https://gist.github.com/hyperbolix/7df108c8169cf54bcc72fdc499be1dce
  4. Open the Drill query interface and issue this query: select * from http.a_bug_test where total_results_val=300 and page_size_val=10 limit 15
  5. See that it returns 10 results, but it should return 15.
  6. Modify the Python3 HTTP server, moving lines 52-58 so they come before line 50, so that the 'has_next' and 'next_page' fields come before the 'results' field.
  7. Restart the Python3 HTTP server and issue the query again from step 4.
  8. Observe than now 15 results are returned.

Expected behavior The INDEX pagination mode page token fields should be honored regardless of field ordering in the JSON response. Or, at a minimum, the documentation should clearly indicate that these fields must come before the results field (which is not the order in the example in the HTTP pagination readme today).

Error detail, log output or screenshots When the 'has_next' and 'next_page' fields come after the data path 'results' field, they are ignored, meaning the result set is incorrectly trimmed short, as though there are not additional pages, yet there are additional pages.

Drill version Observed in 1.21.1 and also in the latest commit as of 2024-03-26: 749772cb0bd83c1a8fe455410ec80b1e5a9bf239

cgivre commented 6 months ago

@hyperbolix I've been reading this ticket and I'm a little confused here. The index pagination is intended to be used when the API in question returns an index or an index link to indicate the next page.

With that said, the parameters page_size and total_results suggest that the correct pagination method would be page pagination. (https://github.com/apache/drill/blob/master/contrib/storage-http/Pagination.md#page-pagination). Page pagination is when the API provides paging information such as the page size and result count.

If you use the page pagination with a limit, that will work and there are unit tests for that.

hyperbolix commented 6 months ago

This test rig doesn't exactly represent the system I've been experimenting with using Drill that led to the discovery of this bug. It was simply the most expedient way to demonstrate the issue. It helped to be able to tune total result size and max results per page to prove the issue would occur at various page sizes, and it helped to keep the 'next_page' token simple.

I appreciate your guidance in the matter; I can assure you the I am actually using page tokens in my experiments thereby necessitating the use of the 'index' pagination method. I can't recall a single cloud computing list API that doesn't use page tokens, and if I came across one, I'd be worried about using it.