aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.1k stars 2.01k forks source link

Wrong URL parsing in redirect #6626

Open ldebortoli opened 2 years ago

ldebortoli commented 2 years ago

Describe the bug

I'm scrapping the marvel fandom site using tasks of asyncio. Every time that I try to get the HTML document from a URL that contains the '&' symbol I have the same issue:

For example with: https://marvel.fandom.com/wiki/Amazing_Spider-Man_&_Silk:_The_Spider(fly)_Effect_Vol_1

This link redirects to another link where the & symbol is replaced with the % encoding (& = %26). So the actual URL of the page is https://marvel.fandom.com/wiki/Amazing_Spider-Man_%26_Silk:_The_Spider(fly)_Effect_Vol_1

Then I run the following code and I get an error

image

To catch the exception I made the following:

image

And as you can see the code redirects many times until max_amount_redirects is reached. The problem is that the actual URL obtained in the redirection ('Location': 'https://marvel.fandom.com/wiki/Amazing_Spider-Man_%26_Silk:_The_Spider(fly)_Effect_Vol_1') apparently is again parsed with % code and %26 is replaced by &.

The same error occurs if I use https://marvel.fandom.com/wiki/Amazing_Spider-Man_%26_Silk:_The_Spider(fly)_Effect_Vol_1 in the first time, %26 is replaced by & and redirects until limit is reached. To make it work I had to replace & with %%26 so it parses to a literal "%26" in the url.

To Reproduce

image

Expected behavior

Redirects correctly

Logs/tracebacks

No applicable

Python Version

$ python --version

Python 3.10.1

aiohttp Version

$ python -m pip show aiohttp

Name: aiohttp
Version: 3.8.1
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: None
Author-email: None
License: Apache 2

multidict Version

$ python -m pip show multidict

Name: multidict
Version: 6.0.2
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2

yarl Version

$ python -m pip show yarl

Name: yarl
Version: 1.7.2
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl/
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2

OS

Manjaro, Linux

Related component

Client

Additional context

No response

Code of Conduct

NewGlad commented 11 months ago

Can be fixed by using ClientSession(requote_redirect_url=False)

I don't know what would be a better solution here, interested to hear core devs opinion.

Dreamsorcerer commented 11 months ago

Well, that certainly looks like the fix.

I guess the question is whether it makes sense that it requotes by default... Is there a security concern here? @webknjaz

Dreamsorcerer commented 11 months ago

For reference, the parameter was added in response to #1474. It appears there was no discussion/consideration to changing the default behaviour.

webknjaz commented 11 months ago

Not sure, I wouldn't change this — there's always going to be somebody for whom the current default is better and vice versa.

Dreamsorcerer commented 11 months ago

I'm just thinking that for the average user, they expect the redirect to just work correctly. But, requoting in some cases breaks the redirection, and the user doesn't know why and reports a bug like this one. If there's no security issue here, then I think it makes more sense to default to the approach that's expected to work (and must be what requests and all browsers do).

webknjaz commented 11 months ago

@Dreamsorcerer there might be security issue potential with the Location header, but I don't remember exactly off the top of my head. It's probably dangerous for apps that reexpose/reinterpret the resulting URL, maybe proxies (request smuggling et al).

Dreamsorcerer commented 11 months ago

Yeah, I'm just struggling to think how at the moment. If we are going to accept an arbitrary URL and follow the redirection, I'm just not sure what difference it can make by quoting the arguments. It can send us to an attacker URL or whatever regardless...

webknjaz commented 11 months ago

One of the past vulnerabilities allowed piggybacking and additional HTTP request through pipelining, with an up being behind a proxy that interpreted the request differently and split it into two. That was a server-side thing, though. It's usually obvious after such thing happens, and doesn't seem probable before 🤷‍♂️

NewGlad commented 10 months ago

If the requested redirect URL is equal to requested by user, I think there is no point in going to that URL in loop until aiohttp.client_Exceptions.TooManyRedirects is raised.

I assume, in this case a log warning or another exception can be raised, reporting that redirect URL is the same as requested one (or any other in the redirections chain) and suggesting using requote_redirect_url=False.

Dreamsorcerer commented 10 months ago

If the requested redirect URL is equal to requested by user, I think there is no point in going to that URL in loop until aiohttp.client_Exceptions.TooManyRedirects is raised.

Well, I think this behaviour is actually consistent with browsers, which will try around 20 redirects before giving up, even if they are circular.

Also, the redirect URL is different from the URL we actually requested, it's only after we requote the URL that it becomes the same.

Dreamsorcerer commented 10 months ago

Maybe, if we aren't going to change the default, we could atleast add a message to that exception that suggests that requote=False might be what they are looking for. Though this would only help a small portion of people, as most people are probably getting redirected to a wrong page (e.g. 404 or home page) that doesn't redirect again.