Add a page_size method to container.values(), keys(), items()

cjtitus commented 10 months ago

Summary

We should add a page_size keyword to the values(), keys(), and items() methods of Tiled containers to allow for advanced control of the amount of data downloaded at one time.

Background

While testing python code to access Tiled repositories, it was noticed that catalog.values()[index] is much slower than catalog[uid] despite being the same basic sort of operation. Indeed, the Tiled documentation https://blueskyproject.io/tiled/reference/python-client.html even claims to support "efficient random access".

Digging into the Tiled requests, it seems that all calls to catalog.values() request the default page_size of 100. This means that 100 values will be fetched, even if we know we are only going to access one.

Proposal

I propose a two-fold change to the data access. First, catalog._keys_slice and catalog._items_slice (which actually fetch the data) would be updated so that they request a page_size of max(start - stop, DEFAULT_PAGE_SIZE). This would instantly make all calls to values(), items(), and keys() more efficient.

Second, for advanced use cases, it would be easy to add a page_size keyword to the values(), items(), and keys() functions that could override the default page size. This would be especially useful when populating a table in a GUI with search results from a catalog, where the correct page_size should really be equal to the number of rows that the GUI requests to update.

Comments welcome.

danielballan commented 10 months ago

Thanks for writing this up.

I have no notes on the first part of the proposal (max). That should be done; it was probably an oversight not to do in the first place.

I would suggest a minor change to the second part. The methods keys(), values(), and items() are defined by the collections.abc.Mapping interface. Extending them by adding optional kwargs is certainly possible but a little invasive of a well-established interface of the language. I suspect it could also be hard to discover the option.

What about adding a method to ValuesView which returns another ValuesView? The usage would be c.values().page_size(…). Likewise for keys and items.

This extends our own iterables, which already have custom behavior in []. And it would be discoverable with tab-complete.

danielballan commented 10 months ago

padraic-shafer commented 10 months ago

Something to be mindful of with paging is whether the results could change from call to call.

For a fixed set of results, will they always be returned in the same order? If not, maybe it's simply the client's responsibility to first sort as needed before paginating.

For a writeable catalog, how does an added/deleted item affect the "pages" that get returned? I think I've seen schemes that include an item id in the pagination request. As in, give me the 10 items before uid-xxxx, or the next 20 items after uid-yyyy. Others use a database-like cursor id to track the position in the results stream.

cjtitus commented 10 months ago

@padraic-shafer These are good questions, but I want to be clear that I'm not implementing paging myself. There is already a page_size parameter in the HTTP interface to Tiled that takes care of the paging, and presumably already has an opinion about both of your questions.

What this issue proposes is just to have the Python API start using the page_size parameter in a more dynamic way, rather than have page_size always be set to 100.

@danielballan Adding a page_size method seems fine, if you don't want to mess too much with the dictview format. It fits decently well with the paradigm of adding ever more functions to narrow a catalog down. I.e, for item in catalog.search(...).search(...).search(...).items() is something I already do, so tacking a .page_size(10) on the end certainly fits the pattern. Whether this is really more "discoverable", who knows. I'll try to add it to the documentation. Though building docs is often more of a pain than fixing the code...

danielballan commented 10 months ago

I copied @padraic-shafer's comment (a good comment!) to https://github.com/bluesky/tiled/issues/647 and responded there. We can keep this issue focused on exposing, via the Python client, the existing pagination options supported by the server.

For now, just added a line to the reference docs would be fine, @cjtitus:

https://github.com/bluesky/tiled/blob/c76d1b3bf0468df8497568dfd9d6580207479a40/docs/source/reference/python-client.md?plain=1#L32-L33

Later we can include this in a new how-to guide on performance-tuning for metadata requests.

danielballan commented 1 week ago

Change 1: Add `page_size()` method

The goal is to support selecting a page size, rather than accepting the default. Usage like:

for x in c.values().page_size(5):
    ...

for x in c.items().page_size(5):
    ...

for x in c.keys().page_size(5):
    ...

The __init__ on all classes in https://github.com/bluesky/tiled/blob/main/tiled/iterviews.py should grow a new optional keyword argument page_size=None, new instance state self._page_size, and new public method

def page_size(self, n):
    "Set page size, to tune performance."
    # Return a new instance of this object with all the same state
    # except page_size set to n.
    return type(self)(..., page_size=n)

They should then pass this new self._page_size setting into their calls to _keys_view and _items_view.

Over in tiled.client.container, we need to add an optional parameter page_size to the methods _keys_view and _items_view and pass this page_size into params in the requests issued there. Some refactoring will be required because this code currently uses the next page URL provided by the server as-is. If a page_size is set, we'll need to alter that URL (or construct it from scratch) to incorporate a custom page_size.

https://github.com/bluesky/tiled/blob/0143d81401cced37a774970a03f5ae064642957a/tiled/client/container.py#L346-L450

Use

from tiled.client import show_logs
show_logs()

to see the effect on the HTTP requests.

Change 2: Never request more than we'll use

For use cases like c.values()[:5] we request a full DEFAULT_PAGE_SIZE and then throw away most of the results. If the user doesn't set a page_size (so, page_size is None and we have a bounded start and stop item, we should request a page just large enough to fit them.

bluesky / tiled