dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 25 forks source link

multithread get_assets (or may be other endpoints) as well pagination #1019

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago

eyeballing the output of snippet in https://github.com/dandi/dandi-cli/issues/1018#issuecomment-1137227203 shows wait times after each page. That could be avoided. Pagination output returns count in the initial response. our pagination is simple as in specify the page number (no "next" token). So it should be trivial to

pagination is similar for other end points (e.g. /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/paths/) and I think overall generic. So I think we can implement it generally for any pagination requests to submit multiple pages requests in parallel and speed up interactions with the archive this way.

jwodder commented 2 years ago

@yarikoptic Exactly which endpoints do you want this implemented for, and should those endpoints always use this? Currently, the client paginates the following:

yarikoptic commented 2 years ago

Let's do it at the generic level of pagination handling in paginate. Threading is fast, so let's do threaded fetching of pages >=2 if overall there is more than 1 page. Please add add a check that the next for the first returned result is in the expected form ({base_url}?page=2), and if not -- issue a WARNING and if not os.environ.get("DANDI_PAGINATION_DISABLE_FALLBACK", False) -- continue serially with current implementation, and otherwise

raise AssertionError(f'API changed the pagination strategy. "next" URL {fetched_next_url} is different from expected {next_url}')

or alike.

Then in CI we could export DANDI_PAGINATION_DISABLE_FALLBACK in one of the matrix runs so we detect when/if API changes how pagination is done, and clients would still perform reliably even if pagination changes.

jwodder commented 2 years ago

@yarikoptic I implemented this, and then I wrote a test that creates a Dandiset with a bunch of assets and paginates through them. In order to ensure that there would be more than five pages even after https://github.com/dandi/dandi-archive/pull/1107 is merged, I set the number of assets to 617. Unfortunately, uploading that many assets (even when they're tiny) takes too long for the test suite; the shortest amount of time I could get this down to was about 2 and a half minutes, and that was by foregoing upload() in favor of just calling LocalAsset.upload() directly (and parallelizing the uploads in a ThreadPoolExecutor only shaved off a couple seconds).

(Threadedly paginating through the assets afterwards takes about 15 seconds, and that was using an earlier number of 1234 assets.)

yarikoptic commented 2 years ago

can you upload 5 assets, and set page_size to 1 to get 5 pages?

jwodder commented 2 years ago

@yarikoptic get_assets() doesn't have a page_size option; only paginate() does. Should all the methods that call paginate() gain this option?

jwodder commented 2 years ago

@yarikoptic Alternatively, per_page (and lookahead?) could be instance attributes of the client instead of method arguments.

yarikoptic commented 2 years ago

Since it is such a low level detail, I am leaning toward client instance attribute. May be even not bother interfacing it in its constructor, just have it defined at the level of a class and be allowed to be overloaded for an instance. But at large -- up to you.

github-actions[bot] commented 2 years ago

:rocket: Issue was released in 0.40.0 :rocket: