4dn-dcic / utils

various util modules shared amongst several projects in our organization
MIT License
4 stars 1 forks source link

Asynchronous versions of API wrappers #278

Open maxwibert opened 1 year ago

maxwibert commented 1 year ago

It would be wonderful if we had asynchronous versions of the API call wrapper functions in this library. For example, if we had an asynchronous function dcicutils.ff_utils.afaceted_search, then it would be "awaitable", meaning users could run the following code:

async def myFunction(key, query_kwargs)
  ...
  batch = await ff_utils.afaceted_search(key=key, **query_kwargs)
  # do something with batch
  ...

This would ensure the library isn't holding up the client's global interpreter lock while the server is putting together the metadata for possibly thousands of records.

willronchetti commented 1 year ago

Thanks for the feature request! We will definitely consider it.

What is the use case you are looking at in this scenario though? Ideally you would use the generator capabilities in search_metadata to automatically paginate results so you are not hanging on very long requests. Generally we'd always recommend using the generator/pagination for more efficient retrieval of results, especially if you're looking to go through a large set of data.

For example, you can use the below code to automatically paginate through all files without hanging on a single request. The generator automatically paginates the requests when iterated through.

search_gen = ff_utils.search_metadata(url + 'search/?type=File',
                                      key=integrated_ff['ff_key'],
                                      is_generator=True)

By default you will get 50 results per page, but you can tune this by passing the page_limit parameter directly to the function. You can also limit the number of results globally by using the limit url parameter ie:

search_gen = ff_utils.search_metadata(url + 'search/?type=File&limit=1000',
                                      key=integrated_ff['ff_key'], page_limit=100,
                                      is_generator=True)
maxwibert commented 1 year ago

My lab is building a project that will hopefully index all the datasets from NIH-supported scientific data repositories. So researchers could, for example start their search from our site with a general idea of what they're looking for, and we would send them to a the 4DN page for some dataset if it was a good match. So my use case is that we need to semi-regularly ingest large quantities of metadata from 134 data-repositories such as 4DN. The pagination is a fair solution for most use cases, but I think it it likely to increase my global lock time in O(1/n), where n is the number of results per page.

netsettler commented 1 year ago

We’re guessing it may suffice for your purposes to use dcicutils.task_utils.pmap for any of the tasks you contemplate? This offers a mechanism for achieving some degree of parallelism in your functions, if they’ll be doing a lot of network waits, without us reprogramming a lot of individual interfaces.

As a career language designer myself, I tend toward wanting our tools to offer a consistent theory of how to achieve certain kinds of goals. While I don’t doubt that the async paradigm is convenient if we’d started from that, our system isn’t really designed around it and I’d want to think it through more carefully before just dropping that kind of thing in randomly for a few functions. Since we already have what we think we have adequate functional capability, we think that’s probably good enough for now.

(If I had more time, the issues I’d want to consider relate to overall sense of resource use. The async marker is very flexible, but also underconstrained in that it has no theory about things like chunk size and it has a minimal do-it-yourself sense of error-handling. Our pmap tool bundles at least some thought and packaged in those areas, in that its chunk size options allow you to think about the total number of things that it’s both efficient and sociable to use at any given time in case unexpected remote responsiveness or caching issues arise, and it bundles error-handling in a useful way that otherwise you’d have to do in less packaged form using some kind of gathering, etc.)