aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.13k stars 2.02k forks source link

Invalid header value char #7494

Closed glmnet closed 1 year ago

glmnet commented 1 year ago

Describe the bug

I had a script doing some scrapping from a cablemodem, the web server is dodgy but used to work, it started failing now

To Reproduce

You won't be able to reproduce as this is my own cablemodem, I'm providing the capture data below.

Expected behavior

Parse the response somehow

Logs/tracebacks

(most recent call last):
  File "C:\Users\glmne\AppData\Local\Programs\Python\Python311\Lib\site-packages\aiohttp\client_reqrep.py", line 899, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\glmne\AppData\Local\Programs\Python\Python311\Lib\site-packages\aiohttp\streams.py", line 616, in read
    await self._waiter
  File "C:\Users\glmne\AppData\Local\Programs\Python\Python311\Lib\site-packages\aiohttp\client_proto.py", line 213, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aiohttp\_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data
aiohttp.http_exceptions.BadHttpMessage: 400, message:
  Invalid header value char:

    bytearray(b'Pragma: no-cache')
                                 ^

Python Version

$ python --version
3.11.4

aiohttp Version

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.8.5

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 6.0.4

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.9.2

OS

Windows 11

Related component

Client

Additional context

I'm not sure what the issue really is, the only odd thing to me is cablemodem answering HTTP/1.0 I don't know if this is to be expected.

Here is the request / response captured with Wireshark

00000000  50 4f 53 54 20 2f 67 6f  66 6f 72 6d 2f 6c 6f 67   POST /go form/log
00000010  69 6e 20 48 54 54 50 2f  31 2e 31 0d 0a 48 6f 73   in HTTP/ 1.1..Hos
00000020  74 3a 20 31 39 32 2e 31  36 38 2e 31 30 30 2e 31   t: 192.1 68.100.1
00000030  0d 0a 41 63 63 65 70 74  3a 20 2a 2f 2a 0d 0a 41   ..Accept : */*..A
00000040  63 63 65 70 74 2d 45 6e  63 6f 64 69 6e 67 3a 20   ccept-En coding: 
00000050  67 7a 69 70 2c 20 64 65  66 6c 61 74 65 0d 0a 55   gzip, de flate..U
00000060  73 65 72 2d 41 67 65 6e  74 3a 20 50 79 74 68 6f   ser-Agen t: Pytho
00000070  6e 2f 33 2e 31 31 20 61  69 6f 68 74 74 70 2f 33   n/3.11 a iohttp/3
00000080  2e 38 2e 35 0d 0a 43 6f  6e 74 65 6e 74 2d 4c 65   .8.5..Co ntent-Le
00000090  6e 67 74 68 3a 20 32 36  0d 0a 43 6f 6e 74 65 6e   ngth: 26 ..Conten
000000A0  74 2d 54 79 70 65 3a 20  61 70 70 6c 69 63 61 74   t-Type:  applicat
000000B0  69 6f 6e 2f 78 2d 77 77  77 2d 66 6f 72 6d 2d 75   ion/x-ww w-form-u
000000C0  72 6c 65 6e 63 6f 64 65  64 0d 0a 0d 0a            rlencode d....
000000CD  75 73 65 72 3d 63 75 73  61 64 6d 69 6e 26 70 77   user=cus admin&pw
000000DD  73 3d 70 61 73 73 77 6f  72 64                     s=passwo rd
    00000000  48 54 54 50 2f 31 2e 30  20 32 30 30 20 4f 4b 0a   HTTP/1.0  200 OK.
    00000010  53 65 72 76 65 72 3a 20  47 6f 41 68 65 61 64 2d   Server:  GoAhead-
    00000020  57 65 62 73 0d 0a 50 72  61 67 6d 61 3a 20 6e 6f   Webs..Pr agma: no
    00000030  2d 63 61 63 68 65 0a 43  61 63 68 65 2d 63 6f 6e   -cache.C ache-con
    00000040  74 72 6f 6c 3a 20 6e 6f  2d 63 61 63 68 65 0a 43   trol: no -cache.C
    00000050  6f 6e 74 65 6e 74 2d 54  79 70 65 3a 20 74 65 78   ontent-T ype: tex
    00000060  74 2f 68 74 6d 6c 0a 53  65 74 2d 43 6f 6f 6b 69   t/html.S et-Cooki
    00000070  65 3a 20 75 73 65 72 69  64 3d 31 36 39 31 34 34   e: useri d=169144
    00000080  38 38 37 30 3b 20 70 61  74 68 3d 2f 3b 0a 0a 3c   8870; pa th=/;..<
    00000090  73 63 72 69 70 74 20 6c  61 6e 67 75 61 67 65 3d   script l anguage=
    000000A0  27 4a 61 76 61 53 63 72  69 70 74 27 3e 77 69 6e   'JavaScr ipt'>win
    000000B0  64 6f 77 2e 6c 6f 63 61  74 69 6f 6e 3d 27 2f 61   dow.loca tion='/a
    000000C0  64 6d 69 6e 2f 63 61 62  6c 65 2d 53 79 73 74 65   dmin/cab le-Syste
    000000D0  6d 69 6e 66 6f 2e 61 73  70 27 3b 3c 2f 73 63 72   minfo.as p';</scr
    000000E0  69 70 74 3e                                        ipt>
    000000E4  3c 2f 68 74 6d 6c 3e 0a                            </html>. 

looking for issues here, I found that this might be fixed by adding: AIOHTTP_NO_EXTENSIONS=1

in that case exception changes to:

    res_login = await session.post(server_name + "/goform/login", data=payload)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\glmne\AppData\Local\Programs\Python\Python311\Lib\site-packages\aiohttp\client.py", line 560, in _request
    await resp.start(conn)
  File "C:\Users\glmne\AppData\Local\Programs\Python\Python311\Lib\site-packages\aiohttp\client_reqrep.py", line 899, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\glmne\AppData\Local\Programs\Python\Python311\Lib\site-packages\aiohttp\streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: RawResponseMessage(version=HttpVersion(major=1, minor=0), code=200, reason='OK\nServer: GoAhead-Webs', headers=<CIMultiDictProxy('Pragma': "no-cache\nCache-control: no-cache\nContent-Type: text/html\nSet-Cookie: userid=1691450383; path=/;\n\n<script language='JavaScript'>window.location='/admin/cable-Systeminfo.asp';</script></html>")>, raw_headers=((bytearray(b'Pragma'), bytearray(b"no-cache\nCache-control: no-cache\nContent-Type: text/html\nSet-Cookie: userid=1691450383; path=/;\n\n<script language=\'JavaScript\'>window.location=\'/admin/cable-Systeminfo.asp\';</script></html>")),), should_close=True, compression=None, upgrade=False, chunked=False)

This might actually be another issue? For a quick research I found HTTP/1.0 will always close the connection after sending the response so ServerDisconnectedError should be "expected"?

Code of Conduct

Dreamsorcerer commented 1 year ago

We have a fix for the error output (though I hadn't seen a bytearray before), but it's probably ending the line with \n which is invalid. HTTP lines must be separated with \r\n. We're making the client side code a little more lenient, so it may start working again in the next release, though I think we'll need to enable another option for handling dodgy line breaks.

Dreamsorcerer commented 1 year ago

Actually, I can't get your actual response to parse, regardless of the lenient settings I enable... The byte string for the response you listed is: b"HTTP/1.0 200 OK\nServer: GoAhead-Webs\r\nPragma: no-cache\nCache-control: no-cache\nContent-Type: text/html\nSet-Cookie: userid=1691448870; path=/;\n\n<script language='JavaScript'>window.location='/admin/cable-Systeminfo.asp';</script></html>\n"

If I allow lenient CRLF handling, then I still get:

   aiohttp.http_exceptions.BadHttpMessage: 400, message:
     Invalid header token:

       b"Pragma: no-cache\nCache-control: no-cache\nContent-Type: text/html\nSet-Cookie: userid=1691448870; path=/;\n\n<script language='JavaScript'>window.location='/admin/cable-Systeminfo.asp';</script></html>\n"
E                                                                                                                     ^

So, it still trips up after the headers. If possible, I'd suggest getting the server fixed to use \r\n for line breaks.

Dreamsorcerer commented 1 year ago

For the no extensions version, I think it may be similar, as the reason should be OK, but seems to end up with the entire message, suggesting that it again fails to parse it correctly without the correct linebreaks.

Probably, the server closed the connection because it has sent the full response, but the parser still thinks it's reading the HTTP status line and is waiting for the rest of the response.

Dreamsorcerer commented 1 year ago

Actually, llhttp is fixing it: https://github.com/nodejs/llhttp/issues/236 So, we can get this working again in a future release.

glmnet commented 1 year ago

Thanks for taking care of this, IMHO making the parser stricter will yield in lots of reports like this. I support the cause though. I wish I could get in the middle there to fix the response before the parser handles it.

Dreamsorcerer commented 1 year ago

It's been made stricter due to security vulnerabilities. But, for the client side, we can allow some of the lenient options. We will still be enabling strict parsing when using dev mode (python -X dev) though, to help developers find and fix broken HTTP responses.

glmnet commented 1 year ago

Thanks. I'll close this when I find the parsing handling the line breaks gracefully

Dreamsorcerer commented 1 year ago

Currently, 9.0.1 works if I use \r\n\r\n separating the body, but still fails with the exact response you posted.

pjanuario commented 1 year ago

I believe I have the same issue, I am doing requests to a domotic device implemented in a cgi by the manufacturer, I can get the response properly while using requests library but while using aiohttp it fails with the message:

err 400, message="Invalid header value char:\n\n b'Content-type: text/html'\n ^"

I guess this is likely related with message not being entire well formed?

Is there any workaround I can put in place?

pjanuario commented 1 year ago

I believe I have the same issue, I am doing requests to a domotic device implemented in a cgi by the manufacturer, I can get the response properly while using requests library but while using aiohttp it fails with the message:

err 400, message="Invalid header value char:\n\n b'Content-type: text/html'\n ^"

I guess this is likely related with message not being entire well formed?

Is there any workaround I can put in place?

This might give some hints, I believe it's because of a new line in the headers string?

 $ curl -k -iv --raw .....
* ALPN: offers h2,http/1.1
* using HTTP/1.x
> POST /recepcion_datos_4.cgi HTTP/1.1
> User-Agent: curl/7.88.1
> Accept: */*
> accept-encoding: gzip,deflate
> Content-Length: 16
> Content-Type: application/x-www-form-urlencoded
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
HTTP/1.0 200 OK
< Content-type: text/html
Content-type: text/html

<
error_MODO_on_off=0
on_off=0
modo_operacion=0
modo_func=1
estado=0
consigna_potencia=3
consigna_temperatura=20.5
temperatura=24.2
temperatura_ext=---.-
* TLSv1.2 (IN), TLS alert, close notify (256):
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, close notify (256):
Dreamsorcerer commented 1 year ago

That's not the same, and I can't tell exactly from your output, but maybe because the server is erroneously using \n instead of \r\n. Possibly related to: https://github.com/nodejs/llhttp/issues/241

Dreamsorcerer commented 1 year ago

Sorry, mixing it up with another issue. That is the same issue, if that is the case. But, your output doesn't show the bytes, so I can't tell if it is using \r\n or just \n.

pjanuario commented 1 year ago

I believe I have the same issue, I am doing requests to a domotic device implemented in a cgi by the manufacturer, I can get the response properly while using requests library but while using aiohttp it fails with the message:

err 400, message="Invalid header value char:\n\n b'Content-type: text/html'\n ^"

I guess this is likely related with message not being entire well formed?

Is there any workaround I can put in place?

from the error message of the aiohttp exception above I believe it adds \n and empty strings after it. I have seen more people in homeassistant community having troubles with this. In our particular case we are mostly querying servers that run from firmwares unlikely to be updated. :/ Is there a way to workaround this? to make it less strict I guess?

pjanuario commented 1 year ago

I tried to downgrade the module version and test, it doesn't work 3.8.4 but it works with 3.8.3

Dreamsorcerer commented 1 year ago

We didn't touch the parser in 3.8.4: https://github.com/aio-libs/aiohttp/compare/v3.8.3...v3.8.4

pjanuario commented 1 year ago

We didn't touch the parser in 3.8.4: v3.8.3...v3.8.4

That is odd, it doesn't work with 3.8.5...3.8.4, I am not very familiar with python ecosystem, what the way to provide more information? Above I just sent the error message from aiohttp and the curl response.

Dreamsorcerer commented 1 year ago

Maybe if it works with AIOHTTP_NO_EXTENSIONS=1 (envvar), then you can just paste the result from resp.read(). But, curl output is going to be interpreted by the terminal, so it's not helping here. We need the raw binary string in Python, or the hexcodes as in wireshark example above or something similar.

black-snow commented 1 year ago

Same error here on win10. Downgrade to 3.8.3 fixed the issue.

Dreamsorcerer commented 1 year ago

This should be fine in master now, should have a new release shortly.

Dreamsorcerer commented 1 year ago

These kind of messages should parse now in 3.8.6. If you use Python dev mode though, it will go back to strict parsing, to help discover bugs in servers.

glmnet commented 1 year ago

Thanks for the heads up, I've found this while coding for a tiny scrapper which runs inside Home Assistant, they bumped to 3.8.6 already but now they are beta-testing 3.9, anyway I won't be checking this until they release it which will be first Wednesday of November.

glmnet commented 1 year ago

Tested this today on desktop, works great with 3.8.6. Thank you