aio-libs / yarl

Yet another URL library
https://yarl.aio-libs.org
Apache License 2.0
1.3k stars 160 forks source link

Germanic problems -- IDNA does not round-trip #148

Closed wumpus closed 6 years ago

wumpus commented 6 years ago
import aiohttp
import asyncio
import sys

hostname = None

async def fetch(session, url):
        async with session.get(url) as response:
            return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, hostname)
        print(html)

hostname = sys.argv[1]
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
$ python ./bug.py http://xn--einla-pqa/robots.txt
...
UnicodeError: ('IDNA does not round-trip', b'xn--einla-pqa', b'einlass')
wumpus commented 6 years ago
UnicodeError: ('IDNA does not round-trip', b'xn--schnellimbi-56a', b'schnellimbiss')
UnicodeError: ('IDNA does not round-trip', b'xn--fubett-cta', b'fussbett')
UnicodeError: ('IDNA does not round-trip', b'xn--fleiig-eta', b'fleissig')
UnicodeError: ('IDNA does not round-trip', b'xn--fupflegegeraet-1fb', b'fusspflegegeraet')
UnicodeError: ('IDNA does not round-trip', b'xn--metallverschluee-ulb', b'metallverschluesse')
UnicodeError: ('IDNA does not round-trip', b'xn--einla-pqa', b'einlass')
asvetlov commented 6 years ago

The exception is raised by python internals on b'xn--einla-pqa'.decode('idna'). The codec doesn't support error handling mode other than 'strict'. Honestly I have no idea how to help you.

Quick googling give resources like https://stackoverflow.com/questions/9806036/idna-does-not-round-trip It describes the problem but has no solution.

socketpair commented 6 years ago

I have created bug report : https://bugs.python.org/issue32437

asvetlov commented 6 years ago

Well, https://pypi.python.org/pypi/idna supports IDNA2008 standard. We could just replace builtin idna codec with the library. The change is trivial. @wumpus would you provide a Pull Request?

wumpus commented 6 years ago

I can't figure out how to configure the yarl development environment, sorry.

You're correct that it's a fairly trivial change, 3 lines of code and 4 lines of test.

asvetlov commented 6 years ago

clone yarl, create virtual enviroment and activate it.

$ cd yarl
$ pip install -r requirements/dev.txt
$ make test

I you will have problems still -- please let me know.

wumpus commented 6 years ago

That did work -- I had been trying to do "python ./setup.py install" before and ended up with my changed tests still running against the old code. Anyway. Now I'm good and I made the pullreq.

wumpus commented 6 years ago

OK I'm ready for a review https://github.com/aio-libs/yarl/pull/149

asvetlov commented 6 years ago

I've published yarl 0.18 with related changes. The main idea is: try idna to encode into IDNA2008. If it fails ('_' is not supported by 2008 edition for example) -- use .encode('idna') to process IDNA2003.

asvetlov commented 6 years ago

Fixed I hope

wumpus commented 6 years ago

Ha ha ok so

./bug.py https://xn--ho-hia.de

still fails but the error is in python's ssl library. Not your problem. yarl seems fine now for broad web crawls.

I have added this ssl.py failure info to https://bugs.python.org/issue17305

hellysmile commented 6 years ago

https://github.com/aio-libs/idna_ssl fix for python ssl idna bug