aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.12k stars 2.02k forks source link

asyncio.TimeOut downloading large files #2249

Closed djyotta closed 7 years ago

djyotta commented 7 years ago

Long story short

Long downloads (>5GB) timeout with asyncio.TimeOutError. concurrent.futures._base.TimeoutError

Expected behaviour

Downloading arbitrarily large downloads should work.

Actual behaviour

Fails ~5GB in with asyncio.TimeOut

Steps to reproduce

Try downloading a large file (> 5GB)

@asyncio.coroutine
def download_file(url, buffer_size=16144):
    resp = yield from session.get(url)
    while True:
        chunk = yield from resp.content.read(buffer_size)
        if not chunk:
            break
    resp.close()

Your environment

shell> pip3 list
aioes (0.7.2)
aiohttp (2.2.5)
asn1crypto (0.22.0)
async-timeout (1.3.0)
asyncio (3.4.3)
asyncssh (1.8.1)
basicauth (0.4.1)
cffi (1.10.0)
chardet (3.0.4)
cryptography (2.0.2)
idna (2.0)
multidict (3.1.3)
pexpect (4.0.1)
pip (8.1.1)
ply (3.10)
ptyprocess (0.5.1)
py (1.4.34)
pyasn1 (0.2.3)
pycparser (2.18)
pycryptodome (3.4.6)
pysmi (0.1.3)
pysnmp (4.3.9)
pytest (3.2.1)
pytest-asyncio (0.6.0)
pytest-repeat (0.3.0)
python-dateutil (2.4.2)
requests (2.10.0)
setuptools (20.7.0)
six (1.10.0)
snimpy (0.8.10)
wheel (0.29.0)
yarl (0.12.0)

Workaround

Use the requests library instead

@asyncio.coroutine
def download_file(url, buffer_size=16144):
    resp = requests.get(url, stream=True)
    for chunk in resp.iter_content(chunk_size=buffer_size):
        continue
    resp.close()
hellysmile commented 7 years ago

Can You try

resp = yield from session.get(url, timeout=None)
samuelcolvin commented 7 years ago

as @hellysmile says this is simply a case of changing the timeout (default is 5mins).

For reference, here's an example which happily downloads a 10gb file:

import asyncio
from time import time

import aiohttp

async def main():
    total_size = 0
    start = time()
    async with aiohttp.ClientSession() as session:
        async with session.get('http://speedtest.tele2.net/10GB.zip', timeout=None) as r:
            while True:
                chunk = await r.content.read(16144)
                if not chunk:
                    break
                total_size += len(chunk)
                print(f'{time() - start:0.2f}s, downloaded: {total_size / (1024 * 1024):0.0f}MB')

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
djyotta commented 7 years ago

Thanks. Looks like I didn't RTFM hard enough.

danielrobbins commented 6 years ago

Timeout is not mentioned on this page: http://aiohttp.readthedocs.io/en/v3.0.1/client_advanced.html#client-session . read_timeout is mentioned here, but pretty hard to spot: http://aiohttp.readthedocs.io/en/v3.0.1/client_reference.html#aiohttp.ClientSession

I don't see timeout documented at all. This timeout default needs to be covered in the introductory ClientSession docs as it's annoying to have to google and dig into the source code to learn that things are dying because of a built-in 5-minute default.

danielrobbins commented 6 years ago

And yes, I spent a better part of a day trying to figure out why my aiohttp-based spider was dying on a handful of large files. Not fun.

danielrobbins commented 6 years ago

I would actually recommend turning off transfer timeouts entirely by default, as HTTP is used for so many things, it is not a great idea to make the assumption that a 5-minute default timeout is actually helpful.

asvetlov commented 6 years ago

Would you create a Pull Request for documentation improvement?

aiohttp should have reasonable default timeout exactly because HTTP is used for so many things. Hanging for hours without any response to user is even worse than explicit timeout error.

danielrobbins commented 6 years ago

Yes, I can do a pull request. I disagree with your perspective. Consider that a download of a 1GB file can easily take more than 5 minutes. Having aiohttp abort these HTTP transfers in its default configuration is problematic. Clients are requesting files from servers and generally servers do a great job of ending connections. Clients should continue to communicate with the server as long as it is active.

If there is going to be a default timeout, it should not be based on the total connect time but the last time a packet was transmitted. Using total connect time is ineffective. Aborting after 30 seconds of no data transferred either way is a reasonable default. Current approach is not.

samuelcolvin commented 6 years ago

Completely disagree, 99.9% of requests which exceed 30s have gone wrong.

Better documentation and a descriptive error would be good but the default shouldn't change

danielrobbins commented 6 years ago

Samuel, this isn't even a matter of opinion. Killing a completely valid and RFC-compliant HTTP connection by default just because you think that it's been running too long so it's "probably wrong" is idiotic. Who are you to make this determination about someone else's HTTP request? I mean no offense to you personally but the opinion you express is frankly worthy of ridicule.

fafhrd91 commented 6 years ago

@danielrobbins aiohttp has enough flexibility regarding timeouts. did you have a chance to check documentation or just complaining? in any case your tone is not very appropriate

danielrobbins commented 6 years ago

@fafhrd91 My intent is to be respectful but with very strong disagreement to the decision, which I believe to be foolish, and I can mock the decision without mocking the person. There is no way for you to determine "tone" through electronic communication. Regarding the technical decision, shall we have two versions of aiohttp, "idiot's edition" and "professional edition." In the idiot's edition, let's assume that the user is a moron and kill their HTTP requests after 5 minutes as a service to them because they are probably just being stupid, and because most people are idiots. Then in the professional edition, we can assume that the user actually wants the HTTP request to complete successfully. This may sound like a moronic idea but it's essentially what upstream is doing. The "requests" module allows large transfers to complete successfully because they make the smart decision to be RFC-compliant and not try to "protect" the user from imaginary error conditions.

fafhrd91 commented 6 years ago

I don't have any intention to continue any communication. this is open source software, fork and change however you want.

danielrobbins commented 6 years ago

Sure, I'm happy to fork and create patches for Gentoo and Funtoo. And again, no disrespect intended. If you are insulted by having someone think that your idea is bad, sorry, but you'll need get used to it. People expect things to work in an RFC-compliant manner by default, particularly in python modules that claim to be implementing a standard protocol.

danielrobbins commented 6 years ago

Consider this critical but very real customer feedback. I am using aiohttp for non-trivial projects and do not like this default behavior at all. It caused me much frustration.

samuelcolvin commented 6 years ago

I get your point of view, but it's just a default. Easy enough to change.

Library decisions that constrain you and can't be overridden are are very annoying but it's hard to see why this default upsets you so much.

danielrobbins commented 6 years ago

@samuelcolvin because as I mentioned, it caused me to waste most of a day trying to track it down. That is frustrating. My frustration is expressed in my very strong negative reaction to this decision. I think it is a poor default, as I am sure others will continue to run into this issue. While improved documentation can be considered a workaround, I just don't see the point of artificially terminating HTTP connections that you deem to have existed for too long. I don't buy the argument that most HTTP connections that have a duration over 5 minutes are invalid. It just seems extremely silly to me. Anyway, I have conveyed my point and now it's time to end this conversation.

danielrobbins commented 6 years ago

Please note that this bug report may also be related to this unfortunate default: https://github.com/aio-libs/aiohttp/issues/2849

There are potentially other bug reports that could be as well. While not confirmed, I would expect regular new bug reports that are ultimately related to this particular problem since it's not RFC-compliant or expected behavior.

asvetlov commented 6 years ago

@danielrobbins please keep calm. You messages are very emotional and not respectful at all. smart, foolish and idiotic are not appropriate terminology. If you want spit your negative emotions -- please do it outside of the tracker. This place is not a replacement for psychoanalyst surgery for free.

I don't want to continue this useless and emotionally hot conversation. If you want to discuss a technical question -- please create a new issue.

asvetlov commented 6 years ago

@aio-libs/aiohttp-committers please don't hesitate to press "Lock conversation" button if needed. Frozen thread can be unlocked easy but there is no possibility to return back written and read offensive words.