edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

Sessions should have a default timeout #66

Closed Mr0grog closed 3 years ago

Mr0grog commented 3 years ago

In https://github.com/edgi-govdata-archiving/web-monitoring-processing, we got through a lot of work to carefully configure sessions for downloading Mementos, but have a bug I’m not fairly sure is caused by not having timeouts on searches (it is possible for for the CDX API to hang for… a long time).

I think we need to pick some reasonable default timeout on WaybackSession to help avoid this issue. Not yet sure what good values would be.

lion-sz commented 3 years ago

Setting a timeout in the Session did not solve the problem, since (from what I understand) it is never applied.

WaybackSession.send sets the timeout that is passed on to the send call to the session's timeout, only if both the session's timeout is not None and the kwargs do not already contain a timeout. But when no timeout is specified in the original call, the kwargs still contains a timeout key with None as value, therefore the sessions timeout does not overwrite this.

This snippet shows the problem:

import requests
class CustomSend(requests.Session):

    def __init__(self):
        super().__init__()

    def send(self, *args, **kwargs):
        print(kwargs)
        super().send(*args, **kwargs)

session = CustomSend()
res = session.request('GET', 'https://duckduckgo.com')

{'timeout': None, 'allow_redirects': True, 'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}

I've fixed the issue here: https://github.com/LionSzl/wayback/commit/46ee8c60b20d269c33b4ad0f7b7498f77fb580e9 Should I make a pull request?

Mr0grog commented 3 years ago

Whoa, good catch!

A PR would be great, BUT I’m not sure about that particular solution, since it prevents someone from sending a request with no timeout if for some reason they need to. I can think of two approaches:

  1. Allow timeout=0 or timeout=(x, y) (where x or y could be 0) and change values <= 0 to None.

  2. Also override WaybackSession.request() and add the logic there, too.

Neither seems great (departing in a weird way from Requests’ API vs. doubled-up, redundant logic), but both solve the issue. I'd lean towards (2) since it doesn't change how the API is supposed to work. If you have another, better idea, feel free to use it in your PR.

I also still think it makes sense to set a default.

lion-sz commented 3 years ago

I think of these two overriding WaybackSession.requests would be the better way, especially since it does not require much additional code. However, for now there is no way to actually provide a timeout value to the requests function without hacking the code. I've also set a default timeout of 60 seconds, I guess this should be a reasonable upper limit for most use cases.