aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.08k stars 2.01k forks source link

Encoding detection can lead to `LookupError` #2732

Closed gilbsgilbs closed 6 years ago

gilbsgilbs commented 6 years ago

Long story short

Some encoding detected by cchardet are unsupported by python (e.g. VISCII). This makes text() function raise a LookupError when such encoding is detected, even when errors parameter is set to 'ignore' (which I would have assumed to be safe).

Expected behaviour

Not sure about this. Maybe get_encoding() should codecs.lookup for the detected encoding to ensure it is known, and if it isn't, fallback to UTF-8. Or document properly that text() function might raise LookupError or that get_encoding() result is not safe to pass to text(encoding) or .decode(encoding) directly.

Actual behaviour

LookupError is thrown.

Steps to reproduce

async def test_text_viscii(loop, session):
    response = ClientResponse('get', URL('http://def-cl-resp.org'))
    response._post_init(loop, session)

    def side_effect(*args, **kwargs):
        fut = loop.create_future()
        body = b'''\
\x43\x68\xe6\x20\x51\x75\xaf\x63\x20\x6e\x67\xe6\x20\x6c\xe0\x20
\x68\xae\x20\x63\x68\xe6\x20\x76\x69\xaa\x74\x20\x74\x68\xaf\x6e
\x67\x20\x6e\x68\xa4\x74\x20\x63\x68\xed\x6e\x68\x20\x74\x68\xd1
\x63\x20\x68\x69\xae\x6e\x20\x6e\x61\x79\x20\x63\xfc\x61\x20\x74
\x69\xaa\x6e\x67\x20\x56\x69\xae\x74\x2c\x20\x73\xd8\x0a\x64\xf8
\x6e\x67\x20\x6b\xfd\x20\x74\xf1\x20\x4c\x61\x20\x54\x69\x6e\x68
\x2c\x20\x64\xf1\x61\x20\x74\x72\xea\x6e\x20\x63\xe1\x63\x20\x62
\xe4\x6e\x67\x20\x63\x68\xe6\x20\x63\xe1\x69\x20\x63\xfc\x61\x20
\x6e\x68\xf3\x6d\x20\x6e\x67\xf4\x6e\x20\x6e\x67\xe6\x20\x52\xf4
\x6d\x61\x6e\x2c\x5b\x31\x5d\x20\xf0\xa3\x63\x0a\x62\x69\xae\x74
\x20\x6c\xe0\x20\x62\xe4\x6e\x67\x20\x63\x68\xe6\x20\x63\xe1\x69
\x20\x42\xb0\x20\xd0\xe0\x6f\x20\x4e\x68\x61\x2c\x5b\x32\x5d\x20
\x76\xbe\x69\x20\x63\xe1\x63\x20\x64\xa4\x75\x20\x70\x68\xf8\x20
\x63\x68\xfc\x20\x79\xaa\x75\x20\x74\xd7\x20\x62\xe4\x6e\x67\x20
\x63\x68\xe6\x20\x63\xe1\x69\x20\x48\x79\x0a\x4c\xd5\x70\x2e\x0a'''
        fut.set_result(body)
        return fut

    response.headers = {
        'Content-Type': 'text/plain'}
    content = response.content = mock.Mock()
    content.read.side_effect = side_effect
    await response.text(errors='ignore')  # crash here

Your environment

asvetlov commented 6 years ago

I think mentioning LookupError in documentation is just fine. Fallback to UTF-8 is not reliable: it produces a garbage or even exception most likely. Would you make a Pull Request?

gilbsgilbs commented 6 years ago

@asvetlov I agree, but that seems inconsistent with the current behavior. If the client passes charset=<some unknown encoding> in content type and cchardet can't detect an encoding, get_encoding will fallback to utf-8. It shouldn't.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. If you feel like there's important points made in this discussion, please include those exceprts into that new issue.