edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

Async support #80

Open luckydonald opened 3 years ago

luckydonald commented 3 years ago

When dealing with web requests libraries can gain a lot from being able to making other stuff while waiting for it.

Basically the easiest approach would be to split the actual logic and the requests in separate functions, and move the logic to a base class. From that we would have a subclass each for sync and one for async operation. They would mostly differ in the fact that the async one has async and await in front of definitions and calls respectively.

To give a implementation idea:

class WaybackClientLogic(_utils.DepthCountedContext):
  def _calcualte_final_query(…):
     ...

  def _postprocess_response(…):
     ...
class WaybackSyncClient(WaybackClientLogic):
  def search(…):
    final_query = self._calcualte_final_query(…)
    response = self.session.request('GET', CDX_SEARCH_URL, params=final_query)
    self._postprocess_response(response)

WaybackClient = WaybackSyncClient  # to keep compatibility with old imports
class WaybackAsyncClient(WaybackClientLogic):
  async def search(…):
    final_query = self._calcualte_final_query(…)
    response = await self.session.request('GET', CDX_SEARCH_URL, params=final_query)
    self._postprocess_response(response)
danielballan commented 3 years ago

This makes sense to me. In a similar situation in https://github.com/caproto/caproto I came to prefer composition with rather than inheritance from the logic. Among other things it makes it easier to test.

Are you interested in working on this?

Mr0grog commented 3 years ago

FWIW, I think it’s kind of tough to separate out that way, because of how involved some of the logic is across many HTTP requests, e.g. for get_memento(): https://github.com/edgi-govdata-archiving/wayback/blob/024a80ec0eae44fcf64e3d834d5f43c78937b8fb/wayback/_client.py#L726-L831

Not to say it can’t be done! (That function could probably also be cleaned up a bit.) But I think you’re going to have to get significantly more creative to make this work well in practice. It might be easier to rewrite the whole thing async, and provide a sync wrapper that starts an event loop and runs it on a thread.