aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.14k stars 2.02k forks source link

Redirect not been followed #2502

Open delijati opened 7 years ago

delijati commented 7 years ago

Long story short

Redirect not been followed. I always get a 302 back.

Expected behaviour

Get a 200 back with text filled.

Steps to reproduce

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url, allow_redirects=True) as response:
        return await response.text(), response.status

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
    )

    for ret in results:
        print("status: %s len: %s" % (ret[1], len(ret[0])))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    urls = ['https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python']
    with aiohttp.ClientSession(loop=loop) as session:
        loop.run_until_complete(fetch_all(session, urls, loop))
kxepal commented 7 years ago

Try curl -v -L 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python' - it produce redirection loop. It looks like not a aiohttp, but server bug.

delijati commented 7 years ago

with requests

In [1]: import requests
In [2]: res = requests.get("https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python")
In [3]: res.status_code
Out[3]: 200
In [4]: len(res.text)
Out[4]: 171410
kxepal commented 7 years ago

Not sure what requests does, but even browsers reports about improper redirection caused by this URL. I'm sure we can investigate what the request does to handle it - it's very interesting how -, but I guess it's not our problem if reference HTTP client and browsers tells that there is a redirection issue.

delijati commented 7 years ago

Narrowed it down to setting cookies. http.cookies.__parse_string aka http.cookies.SimpleCookie.load can't parse this cookie [1] str. requests is using http.client.parse_headers.

Possible solutions:

[1] JSESSION_ID_PROFILE_APP=A3C47AF767CCC88FA36B8292A6FB94A3; Path=/; HttpOnly, gp_intern=(null);path=/;Secure, gp_intern=(null);path=/;Secure

kxepal commented 7 years ago

Nice found! So the problem looks like that aiohttp doesn't set / preserve cookies during redirects and that site sets them though thme (horrible solution). But at least now it's clear why happens what we see with curl. Would you like to submit PR?

asvetlov commented 7 years ago

Not sure if we should fix it. gp_intern is not standard property for cookies, http.cookies raises an exception for it. Actually the property name is HttpOnly, gp_intern and value is (null). Another property is Secure, gp_intern with (null) value. For me it looks like a garbage. Multiple Path properties are not supported by standard too.

kxepal commented 7 years ago

@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.

asvetlov commented 7 years ago

The site is protected from usage by correctly implemented client libraries. Well, @delijati could pass allow_redirects=False to session.get() and process redirections on his own.

Honestly I have no idea where and when we will stop If we try process every buggy quirk out of the box.

delijati commented 7 years ago

@asvetlov I'm kind of unsure what would be right solution. Fixing it feels like a ie6 css quirks but any browser I tested sets the the cookie.

Tested with Node:

var request = require('request');

const url = 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python';

request(url, function(error, response, body) {
    console.log('error:', error); // Print the error if one occurred
    console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
    console.log('body:', body); // Print the HTML for the Google homepage.
});

Error:

(node:22299) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
error: Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python
    at Redirect.onResponse (/opt/develop/privat/mobile/death/node_modules/request/lib/redirect.js:98:27)
    at Request.onRequestResponse (/opt/develop/privat/mobile/death/node_modules/request/request.js:996:22)
    at emitOne (events.js:96:13)
    at ClientRequest.emit (events.js:188:7)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:473:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
    at TLSSocket.socketOnData (_http_client.js:362:20)
    at emitOne (events.js:96:13)
    at TLSSocket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
statusCode: undefined
body: undefined
delijati commented 7 years ago

I went a bit deeper the rabbit hole. aiohttp uses SimpleCookie which implements the spec rfc2109. requests uses Cookie the newer rfc2965 witch obsolets spec rfc2109.

Firefox on the other side implements something custom RFC2109/2616

So long story short aiohttp should use the newer Cookie implementation.

asvetlov commented 7 years ago

Thanks for shading a light on the problem.

aiohttp cannot just use http.cookiejar.Cookie -- it has incompatible interface with http.cookies.SimpleCookie. rfc2965 is obsoleted by RFC6265 BTW. The RFC explicitly enumerates all allowed cookie properties (and header name is Set-Cookie, Set-Cookie2 is obsolete). The list of properties is fixed. For SameSite where is a new RFC draft but it is not accepted yet.

asvetlov commented 7 years ago

Sorry, the RFC specifies extension-av rule as extension-av = <any CHAR except CTLs or ";">.

Ideas?

blueyed commented 5 years ago

SimpleCookie.__parsestring is just too strict: it also fails to handle expiration dates incorrectly (parses them based on the sane-cookie-date only), see https://bugs.python.org/issue35824#msg334392. (this means that aiohttp is affected by this (https://github.com/aio-libs/aiohttp/blob/9504fe2affaaff673fa4f3754c1c44221f8ba47d/aiohttp/cookiejar.py#L175), i.e. it cannot handle the following cookie from GitHub: Set-Cookie: has_recent_activity=1; path=/; expires=Mon, 22 Apr 2019 23:27:18 -0000 (due to "-0000" instead of "GMT" at the end).

The interface for

aiohttp cannot just use http.cookiejar.Cookie -- it has incompatible interface with http.cookies.SimpleCookie.

But couldn't it be changed to be based on http.cookies.CookieJar (similar to how requests does it)?

Another/simpler solution might be to pass an adjusted patt to __parsestring (using it directly).

Currently aiohttp fails to handle 3 out of 4 cookies sent by github.com:

   5     async def main():
   6         cjar = aiohttp.CookieJar()
   7         async with aiohttp.ClientSession(cookie_jar=cjar) as session:
   8             async with session.get('https://github.com') as response:
   9                 pass
  10  ->     __import__('pdb').set_trace()
 return None
(Pdb++) pp response.headers.getall("set-cookie")
['has_recent_activity=1; path=/; expires=Fri, 03 May 2019 00:10:35 -0000',
 '_octo=GH1.1.393119700.1556838635; domain=.github.com; path=/; expires=Sun, 02 May 2021 23:10:35 -0000',
 'logged_in=no; domain=.github.com; path=/; expires=Mon, 02 May 2039 23:10:35 -0000; secure; HttpOnly',
 '_gh_sess=NXXXf; path=/; secure; HttpOnly']
(Pdb++) pp len(response.headers.getall("set-cookie"))
4
(Pdb++) pp list(cjar)
[{'comment': '',
  'domain': 'github.com',
  'expires': '',
  'httponly': True,
  'max-age': '',
  'path': '/',
  'samesite': '',
  'secure': True,
  'version': ''}]
(Pdb++) pp len(list(cjar))
1
danilobellini commented 5 years ago

@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.

From Section 3 in RFC6265:

[...] User agents MAY ignore Set-Cookie headers contained in responses with 100-level status codes but MUST process Set-Cookie headers contained in other responses [...]

I'm not sure if that server is/was buggy, but the client implementation is broken w.r.t. that RFC when it ignores cookies set in HTTP 3xx responses. On the other hand, the aiohttp documentation states:

Response cookies contain only values, that were in Set-Cookie headers of the last request in redirection chain.

I found it surprising that cookies set on redirection (HTTP 302 where I've tried) are discarded by aiohttp, though I'm always using a ClientSession instance. I still don't know the reason but now the allow_redirects need to be always False for a scraper I'm working in. For now I'm applying the consecutive requests one by one as a workaround (I'm using aiohttp 3.5.4, Python 3.7.3).

asvetlov commented 5 years ago

If somebody wants to propose a pull request with the fix -- I'll be happy to review it.