Issue with MultipartReader

thehesiod commented 6 years ago

I'm trying to mock the google API batch endpoint which takes a multipart request. It ends up sending a request like this:

--===============0167281537324890874==
Content-Type: application/http
MIME-Version: 1.0
Content-Transfer-Encoding: binary
Content-ID: <ad845523-bc65-4f77-bfad-2b8709570634+1>

GET /gmail/v1/users/me/messages/660cf5e1784d4d98b8d3adeabebb413e?format=metadata&alt=json HTTP/1.1
Content-Type: application/json
MIME-Version: 1.0
accept: application/json
accept-encoding: gzip, deflate
user-agent: google-api-python-client/1.6.2 (gzip)
Authorization: Bearer TOKEN
Host: 192.168.16.6:63191

--===============0167281537324890874==--

with headers:

{'Host': '192.168.16.6:63191', 'Content-Length': '553', 'content-type': 'multipart/mixed; boundary="===============0167281537324890874=="', 'authorization': 'Bearer TOKEN', 'user-agent': 'Python-httplib2/0.10.3 (gzip)', 'accept-encoding': 'gzip, deflate'}

And with a handler like the following:

    async def gmail_batch(self, req: web.Request):
        reader = await req.multipart()
        async for part in reader:
            async for data in part:
                print()

I get the exception:

  File "/Users/amohr/dev/Operations/email_security/svc_tests/mock_google_svcs.py", line 116, in gmail_batch
    async for data in part:
  File "/Users/amohr/.pyenv/versions/3.6.2/lib/python3.6/site-packages/aiohttp/multipart.py", line 235, in __anext__
    part = yield from self.next()
  File "/Users/amohr/.pyenv/versions/3.6.2/lib/python3.6/site-packages/aiohttp/multipart.py", line 242, in next
    item = yield from self.read()
  File "/Users/amohr/.pyenv/versions/3.6.2/lib/python3.6/site-packages/aiohttp/multipart.py", line 261, in read
    data.extend((yield from self.read_chunk(self.chunk_size)))
  File "/Users/amohr/.pyenv/versions/3.6.2/lib/python3.6/site-packages/aiohttp/multipart.py", line 279, in read_chunk
    chunk = yield from self._read_chunk_from_stream(size)
  File "/Users/amohr/.pyenv/versions/3.6.2/lib/python3.6/site-packages/aiohttp/multipart.py", line 322, in _read_chunk_from_stream
    assert self._content_eof < 3, "Reading after EOF"
AssertionError: Reading after EOF

seems like it's not correctly determining the boundary

thehesiod commented 6 years ago

looks like in my example the lines were separated by \n's, and not \r\n

thehesiod commented 6 years ago

I've traced this back to the fact that the default linesep for the email module in python is '\n': https://github.com/python/cpython/blob/master/Lib/email/_policybase.py#L163 used by https://github.com/google/google-api-python-client/blob/master/googleapiclient/http.py#L1328 since policy=None. So the question is should aiohttp check for '\n' + boundary or `\r\n' + boundary as it currently does.

thehesiod commented 6 years ago

in particular the issue is on this line: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/multipart.py#L324

asvetlov commented 6 years ago

https://tools.ietf.org/html/rfc7578 explicitly states CRLF as a separator. We could support LF also though but I not sure. Guys, opinions?

kxepal commented 6 years ago

Only if optional to be explicitly enable by user.

From Python docs about linesep:

It defaults to ``\n`` because that is the most useful value for Python application code (other library packages expect ``\n`` separated lines).  ``linesep=\r\n`` can be used to generate output with RFC-compliant line separators.

I think we should stay with RFC by default.

thehesiod commented 6 years ago

Well it's working with Google servers. We should check how other servers behave.

On Oct 5, 2017 2:13 AM, "Alexander Shorin" notifications@github.com wrote:

Only if optional to be explicitly enable by user.

From Python docs about linesep:

It defaults to \n because that is the most useful value for Python application code (other library packages expect \n separated lines). linesep=\r\n can be used to generate output with RFC-compliant line separators.

I think we should stay with RFC by default.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aio-libs/aiohttp/issues/2302#issuecomment-334407530, or mute the thread https://github.com/notifications/unsubscribe-auth/AD0P_f4VmhADU5gqVrpqjSm-Z59OHLpmks5spJ3AgaJpZM4PuS4i .

kxepal commented 6 years ago

I don't think this check will change everything. We must follow the specifications in all the questions. That's the only way to make sure we're doing everything right.

However, we have to admit that life is...complicated and legacy still exists and backward compatibility with ancients still have to be preserved by major providers etc. etc. For such cases, if they are really worth to be supported (and if they cannot fix their code due to same BC), well, some fix may have to be implemented, but as opt-in feature.

Even not all browsers correctly support multipart spec, so I think some room of customization is possible here.

asvetlov commented 6 years ago

The issue is still waiting for a hero. We need to accept separator argument in both multipart reader and writer (\r\n by default). The change is not very hard but requires some amount of work.

thehesiod commented 6 years ago

btw, I'd like to be able to optionally support both styles, so perhaps a boolean instead of specify if should allow for both?

asvetlov commented 6 years ago

Do you mean a heuristic for detection what style should be used by the reader, \r\n or just \n? Writer should know the style on construction.

thehesiod commented 6 years ago

yes, ala splitlines. There should be a mode where the several can accept either way, that's what the google servers do for example. I bet apache as well.

asvetlov commented 6 years ago

@thehesiod would you work on implementation? The first version can be done without autodetection.

thehesiod commented 6 years ago

ya I'll give it a shot

asvetlov commented 6 years ago

Cool!

webknjaz commented 6 years ago

Nota bene

https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.3 (but it's obsolete since June 2014 by RFC7230):

19.3 Tolerant Applications ... The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR. ...

(https://stackoverflow.com/a/5757349/595220)

https://tools.ietf.org/html/rfc7231#section-3.1.1.3:

3.1.1.3. Canonicalization and Text Defaults ... An HTTP sender MAY generate, and a recipient MUST be able to parse, line breaks in text media that consist of CRLF, bare CR, or bare LF.... This flexibility regarding line breaks applies only to text within a representation that has been assigned a "text" media type; it does not apply to "multipart" types or HTTP elements outside the payload body (e.g., header fields). ... 3.1.1.4. Multipart Types ... The message body is itself a protocol element; a sender MUST generate only CRLF to represent line breaks between body parts.

aio-libs / aiohttp

Issue with MultipartReader #2302

Nota bene