Open delijati opened 7 years ago
Try curl -v -L 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python'
- it produce redirection loop. It looks like not a aiohttp, but server bug.
with requests
In [1]: import requests
In [2]: res = requests.get("https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python")
In [3]: res.status_code
Out[3]: 200
In [4]: len(res.text)
Out[4]: 171410
Not sure what requests does, but even browsers reports about improper redirection caused by this URL. I'm sure we can investigate what the request does to handle it - it's very interesting how -, but I guess it's not our problem if reference HTTP client and browsers tells that there is a redirection issue.
Narrowed it down to setting cookies. http.cookies.__parse_string
aka http.cookies.SimpleCookie.load
can't parse this cookie [1] str. requests
is using http.client.parse_headers
.
Possible solutions:
http.cookies.__parse_string
tested to be broken in 3.5
, 3.6
SimpleCookie
with a patched versionhttp.client.parse_headers
[1] JSESSION_ID_PROFILE_APP=A3C47AF767CCC88FA36B8292A6FB94A3; Path=/; HttpOnly, gp_intern=(null);path=/;Secure, gp_intern=(null);path=/;Secure
Nice found! So the problem looks like that aiohttp doesn't set / preserve cookies during redirects and that site sets them though thme (horrible solution). But at least now it's clear why happens what we see with curl. Would you like to submit PR?
Not sure if we should fix it.
gp_intern
is not standard property for cookies, http.cookies
raises an exception for it.
Actually the property name is HttpOnly, gp_intern
and value is (null)
. Another property is Secure, gp_intern
with (null)
value.
For me it looks like a garbage. Multiple Path
properties are not supported by standard too.
@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.
The site is protected from usage by correctly implemented client libraries.
Well, @delijati could pass allow_redirects=False
to session.get()
and process redirections on his own.
Honestly I have no idea where and when we will stop If we try process every buggy quirk out of the box.
@asvetlov I'm kind of unsure what would be right solution. Fixing it feels like a ie6
css quirks but any browser I tested sets the the cookie.
Tested with Node:
var request = require('request');
const url = 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python';
request(url, function(error, response, body) {
console.log('error:', error); // Print the error if one occurred
console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
console.log('body:', body); // Print the HTML for the Google homepage.
});
Error:
(node:22299) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
error: Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python
at Redirect.onResponse (/opt/develop/privat/mobile/death/node_modules/request/lib/redirect.js:98:27)
at Request.onRequestResponse (/opt/develop/privat/mobile/death/node_modules/request/request.js:996:22)
at emitOne (events.js:96:13)
at ClientRequest.emit (events.js:188:7)
at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:473:21)
at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
at TLSSocket.socketOnData (_http_client.js:362:20)
at emitOne (events.js:96:13)
at TLSSocket.emit (events.js:188:7)
at readableAddChunk (_stream_readable.js:176:18)
statusCode: undefined
body: undefined
I went a bit deeper the rabbit hole. aiohttp
uses SimpleCookie which implements the spec rfc2109
. requests
uses Cookie the newer rfc2965 witch obsolets spec rfc2109
.
Firefox on the other side implements something custom RFC2109/2616
So long story short aiohttp
should use the newer Cookie implementation.
Thanks for shading a light on the problem.
aiohttp cannot just use http.cookiejar.Cookie
-- it has incompatible interface with http.cookies.SimpleCookie
.
rfc2965
is obsoleted by RFC6265 BTW.
The RFC explicitly enumerates all allowed cookie properties (and header name is Set-Cookie
, Set-Cookie2
is obsolete).
The list of properties is fixed. For SameSite
where is a new RFC draft but it is not accepted yet.
Sorry, the RFC specifies extension-av
rule as extension-av = <any CHAR except CTLs or ";">
.
Ideas?
SimpleCookie.__parsestring is just too strict: it also fails to handle expiration dates incorrectly (parses them based on the sane-cookie-date only), see https://bugs.python.org/issue35824#msg334392.
(this means that aiohttp is affected by this (https://github.com/aio-libs/aiohttp/blob/9504fe2affaaff673fa4f3754c1c44221f8ba47d/aiohttp/cookiejar.py#L175), i.e. it cannot handle the following cookie from GitHub: Set-Cookie: has_recent_activity=1; path=/; expires=Mon, 22 Apr 2019 23:27:18 -0000
(due to "-0000" instead of "GMT" at the end).
The interface for
aiohttp cannot just use
http.cookiejar.Cookie
-- it has incompatible interface withhttp.cookies.SimpleCookie
.
But couldn't it be changed to be based on http.cookies.CookieJar
(similar to how requests does it)?
Another/simpler solution might be to pass an adjusted patt
to __parsestring
(using it directly).
Currently aiohttp fails to handle 3 out of 4 cookies sent by github.com:
5 async def main():
6 cjar = aiohttp.CookieJar()
7 async with aiohttp.ClientSession(cookie_jar=cjar) as session:
8 async with session.get('https://github.com') as response:
9 pass
10 -> __import__('pdb').set_trace()
return None
(Pdb++) pp response.headers.getall("set-cookie")
['has_recent_activity=1; path=/; expires=Fri, 03 May 2019 00:10:35 -0000',
'_octo=GH1.1.393119700.1556838635; domain=.github.com; path=/; expires=Sun, 02 May 2021 23:10:35 -0000',
'logged_in=no; domain=.github.com; path=/; expires=Mon, 02 May 2039 23:10:35 -0000; secure; HttpOnly',
'_gh_sess=NXXXf; path=/; secure; HttpOnly']
(Pdb++) pp len(response.headers.getall("set-cookie"))
4
(Pdb++) pp list(cjar)
[{'comment': '',
'domain': 'github.com',
'expires': '',
'httponly': True,
'max-age': '',
'path': '/',
'samesite': '',
'secure': True,
'version': ''}]
(Pdb++) pp len(list(cjar))
1
@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.
From Section 3 in RFC6265:
[...] User agents MAY ignore Set-Cookie headers contained in responses with 100-level status codes but MUST process Set-Cookie headers contained in other responses [...]
I'm not sure if that server is/was buggy, but the client implementation is broken w.r.t. that RFC when it ignores cookies set in HTTP 3xx responses. On the other hand, the aiohttp documentation states:
Response cookies contain only values, that were in Set-Cookie
headers of the last request in redirection chain.
I found it surprising that cookies set on redirection (HTTP 302 where I've tried) are discarded by aiohttp
, though I'm always using a ClientSession
instance. I still don't know the reason but now the allow_redirects
need to be always False
for a scraper I'm working in. For now I'm applying the consecutive requests one by one as a workaround (I'm using aiohttp 3.5.4, Python 3.7.3).
If somebody wants to propose a pull request with the fix -- I'll be happy to review it.
Long story short
Redirect not been followed. I always get a
302
back.Expected behaviour
Get a
200
back with text filled.Steps to reproduce