etianen / aiohttp-wsgi

WSGI adapter for aiohttp.
https://aiohttp-wsgi.readthedocs.io
BSD 3-Clause "New" or "Revised" License
232 stars 20 forks source link

RAW_URI/REQUEST_URI for unquoted path #17

Closed dahlia closed 7 years ago

dahlia commented 7 years ago

This patch adds non-standard fields named RAW_URI and REQUEST_URI to WSGI environ. This solves the following problem.

According to PEP 3333, WSGI requires to unquote PATH_INFO before it's passed to an application, as CGI did. Unfortunately, it makes some unrecoverable loss of its raw request from a client.

For example, if a request was GET /A%2FB, the %2F sequence is unquoted to /A/B.

There's one more difficulty about PATH_INFO when it comes to Python 3. As WSGI's PATH_INFO environ should be a Unicode str, percent-encoded sequence (which is encoding of bytes, rather than Unicode characters) needs to be interpreted in a certain encoding — aiohttp-wsgi indeed interprets them in UTF-8 encoding. It means that if a client requests a path containing some bytes encoded in non-UTF-8 the requested path (e.g. /%C5%D7%BD%BA%C6%AE) will be lost in an unrecoverable way.

Further reading: WSGI Amendments thoughts: the horror of charsets (Web-SIG).

Since this problem has been quite well known to WSGI implementers, several widely-used WSGI servers provide some non-standard workarounds. uWSGI and Apache mod_wsgi provides REQUEST_URI. Gunicorn provides RAW_URI. These fields contain an untouched and uninterpreted path string. Some WSGI frameworks or applications use it (though they fallback to PATH_INFO if these fields don't exist). This patch chose these field names rather than introducing yet another non-standard field.

etianen commented 7 years ago

This seems like a good idea for fitting in with existing workarounds.

FWIW: you could already do this via environ["aiohttp.request"].raw_path.

blueyed commented 5 years ago

I've found this when researching for https://github.com/django/asgiref/issues/87.

It appears to be a bit inconsistent with query info though. The following test passes:

diff --git i/tests/test_environ.py w/tests/test_environ.py
index 3c5ac6d..2499191 100644
--- i/tests/test_environ.py
+++ w/tests/test_environ.py
@@ -89,6 +89,12 @@ def assert_environ_quoted_path_info(environ):
     assert environ['RAW_URI'] == "/%ED%85%8C%2F%EC%8A%A4%2F%ED%8A%B8"
     assert environ['REQUEST_URI'] == "/%ED%85%8C%2F%EC%8A%A4%2F%ED%8A%B8"

+@environ_application
+def assert_environ_quoted_path_info_with_query(environ):
+    assert environ['PATH_INFO'] == "/foo/bar"
+    assert environ['RAW_URI'] == "/foo%2Fbar?query/info"
+    assert environ['REQUEST_URI'] == "/foo%2Fbar?query/info"
+

 class EnvironTest(AsyncTestCase):

@@ -133,3 +139,7 @@ def testEnvironRootSubdirTrailing(self):
     def testQuotedPathInfo(self):
         with self.run_server(assert_environ_quoted_path_info) as client:
             client.assert_response(path="/%ED%85%8C%2F%EC%8A%A4%2F%ED%8A%B8")
+
+    def testQuotedPathInfoWithQuery(self):
+        with self.run_server(assert_environ_quoted_path_info_with_query) as client:
+            client.assert_response(path="/foo%2Fbar?query%2Finfo")

I.e. the query string is percent-decoded here.

This is just a drive-by comment, but I think it would be good to document this via some test at least.

As for ASGI (https://github.com/django/asgiref/issues/87) I think it makes sense to have a raw_path without the additional query string (which is not unquoted there).

etianen commented 5 years ago

So, you're saying that RAW_URI contains the quoted path, but the unquoted query string. That seems pretty weird. I think aiohttp-wsgi should change this behaiour.

Based on your research, would it be more standard to:

  1. Remove the query string from RAW_URI (e.g. /foo%2Fbar).
  2. Quote the query string (e.g. /foo%2Fbar?query%2Finfo).

I'm inclined to think (1.) is correct.